This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Multi-Agent Reinforcement Learning Based on Graph-Induced Local Value Functions

Gangshan Jing, He Bai, Jemin George, Aranya Chakrabortty and Piyush K. Sharma G. Jing is with School of Automation, Chongqing University, Chongqing 400044, China, jinggangshan@cqu.edu.cnH. Bai is with Oklahoma State University, Stillwater, OK 74078, USA. he.bai@okstate.eduJ. George and P. Sharma are with the U.S. Army Research Laboratory, Adelphi, MD 20783, USA. {jemin.george,piyush.k.sharma}.civ@army.milA. Chakarabortty is with North Carolina State University, Raleigh, NC 27695, USA. achakra2@ncsu.edu
Abstract

Achieving distributed reinforcement learning (RL) for large-scale cooperative multi-agent systems (MASs) is challenging because: (i) each agent has access to only limited information; (ii) issues on scalability and sample efficiency emerge due to the curse of dimensionality. In this paper, we propose a general distributed framework for sample efficient cooperative multi-agent reinforcement learning (MARL) by utilizing the structures of graphs involved in this problem. We introduce three coupling graphs describing three types of inter-agent couplings in MARL, namely, the state graph, observation graph and reward graph. By further considering a communication graph, we propose two distributed RL approaches based on local value functions derived from the coupling graphs. The first approach is able to reduce sample complexity significantly under specific conditions on the aforementioned four graphs. The second approach provides an approximate solution and can be efficient even for problems with dense coupling graphs. Here there is a trade-off between minimizing the approximation error and reducing the computational complexity. Simulations show that our RL algorithms have a significantly improved scalability to large-scale MASs compared with centralized and consensus-based distributed RL algorithms.

Index Terms:
Reinforcement learning, distributed learning, optimal control, Markov decision process, multi-agent systems

I Introduction

Reinforcement Learning (RL) [1] aims to find an optimal policy for an agent to accomplish a specific task by making this agent interact with the environment. Although RL has found wide practical applications such as boarding games [2], robotics [3], and power systems [4], the problem is much more complex for multi-agent reinforcement learning (MARL) due to the non-stationary environment for each agent and the curse of dimensionality. MARL, therefore, has attracted increasing attention recently, and has been studied extensively, see the survey papers [5, 6, 7]. Nonetheless, many challenges still remain to be overcome. In this paper, we focus on dealing with the following two difficulties in developing distributed cooperative RL algorithms for large-scale networked multi-agent systems (MASs): (i) how to deal with inter-agent couplings all over the network with only local information observable for each agent; and (ii) how to guarantee scalability of the designed RL algorithm to large-scale network systems?

The first challenge refers to the fact that an agent cannot make its decision independently since it will affect and be affected by other agents in different ways. There are mainly three types of structure constraints causing inter-agent couplings in MARL111In the literature, there have been many different settings for the MARL problem. Our work aims to find a distributed control policy for a dynamic MAS to cooperatively maximize an accumulated global joint reward function. In our literature review, a reference is considered as an MARL reference as long as it employed RL to seek a policy for a group of agents to cooperatively optimize an accumulated global joint function in a dynamic environment. that have been considered in the literature, e.g., coupled dynamics (transition probability) [8, 9], partial observability222Partial observability generally means that only incomplete information of the environment is observed by the learner. In many numerical experiments for MARL, e.g., [10, 11], partial observation refers to the observation on partial agents and the environment. In this work, we consider a specific scenario where each agent only observes partial agents, which is consistent with [12]. [10, 11, 12], and coupled reward functions [13, 9, 14, 15]. The aforementioned references either consider only one type of structure constraints, or employ only one graph to characterize different types of structure constraints. To give a specific example, consider the MAS as a network of linear systems with dynamics coupling, and set the global objective function as an accumulated quadratic function of state and control policy, the problem of learning an optimal distributed policy becomes a data-driven distributed linear quadratic regulator (LQR) design problem[16, 17], which involves all the aforementioned structure constraints. In the literature, distributed RL algorithms e.g., [18, 17, 19] have been proposed to deal with this problem. However, the three types of structure constraints are yet to be efficiently utilized. Moreover, the most popular formulation of MARL has long been known as the Markov games [20], while the multi-agent LQR problem [16] is only a specific application scenario.

The scalability issue, as the second challenge, is resulted from high dimensions of state and action spaces of MASs. Although each agent may represent an independent individual, the inter-agent couplings in the MARL problem make it difficult for each agent to learn its optimal policy w.r.t. the global objective with only local information. In the literature of distributed RL [21, 13, 22, 12], when dealing with these couplings, the most common method is to make different agents exchange information with each other via a consensus algorithm, so that each agent is able to estimate the value of the global reward function although it only has access to local information from neighbors. However, the performance of such distributed RL algorithms will be similar to or even worse than the centralized RL algorithm because (i) it may take a long time for convergence of consensus when the network is of large scale, (ii) essentially the learning process is conducted via an estimated global reward information. As a result, consensus-based distributed RL algorithms still suffer from significant scalability issues in terms of the convergence rate and the learning variance.

In this paper, we develop distributed scalable algorithms for a class of cooperative MARL problems where each agent has its individual state space and action space (similar to [8, 9]), and all the aforementioned three types of couplings exist. We consider a general case where the inter-agent state transition couplings, state observation couplings, and reward couplings are characterized by three different graphs, namely, state graph, observation graph, and reward graph, respectively. Based on these graphs, we derive a learning graph, which describes the required information flow during the RL process. The learning graph also provides a guidance for constructing a local value function (LVF) for each agent, which is able to play the role of the global value function (GVF)333Please note that the abbreviation of “GVF” has been used to denote “General Value Function” in the literature, e.g., [23]. In this paper, “GVF” always means global value function. in learning, but only involves partial agents, and therefore, can enhance scalability of the learning algorithm.

When each agent has access to all the information involved in LVF, distributed RL can be achieved by policy gradient algorithms immediately. However, this approach is usually based on interactions between many agents, which requires a dense communication graph. To further reduce the number of communication links, we design a distributed RL algorithm based on local consensus, whose computational complexity depends on the aforementioned three graphs (see Theorem 1). Compared with global consensus-based RL algorithms, local consensus algorithms usually have an improved convergence rate as the network scale is reduced444Although the convergence rate of a consensus algorithm depends on not only the network scale but also the communication weights, typically the convergence rate can be significantly improved if the network scale is largely reduced. In [24], the relationship between consensus convergence time and the number of nodes in the network is analyzed under a specific setting for the communication weights.. This implies that the scalability of this RL algorithm requires specific conditions on the graphs embedded in the MARL problem. To relax the graphical conditions, we further introduce a truncation index and design a truncated LVF (TLVF), which involves fewer agents than LVF. While being applicable to MARL with any graphs, the distributed RL algorithm based on TLVFs only generates an approximate solution, and the approximation error depends on the truncation index to be artificially designed (see Theorem 2). We will show that there is a trade-off between minimizing the approximation error and reducing the computational complexity (enhancing the convergence rate).

In [25], we have considered the case when no couplings exist between the rewards of different individual agents. In contrast, this paper considers coupled individual rewards, which further induces a reward graph. Moreover, this paper provides more interesting graphical results, a distributed RL algorithm via local consensus, and a TLVF-based distributed RL framework.

The main novel contributions of our work that add to the existing literature are summarized as follows.

(i). We consider a general formulation for distributed RL of networked dynamic agents, where the aforementioned three types of inter-agent couplings exist in the problem, simultaneously. Similar settings have been considered in [8, 9]. The main novelty here is that the three coupling graphs in this paper are fully independent and inherently exist in the problem. Based on the three graphs corresponding to three types of couplings, we derive a learning graph describing the information flow required in learning. By discussing the relationship between the learning graph and the three coupling graphs (see Lemma 13), one can clearly observe how different types of couplings affect the required information exchange in learning.

(ii). By employing the learning graph, we construct a LVF for each agent such that the partial gradient of the LVF w.r.t. each individual’s policy parameter is exactly the same as that of the GVF (see Lemma 2), which can be directly employed in policy gradient algorithms. MARL algorithms based on LVFs have also been proposed by network-based LVF approximation [8, 26, 9] and value function decomposition [14, 15, 27, 28]. However, the network-based LVF approximation only provided an approximate solution. Moreover, the aforementioned value function decomposition references always assumed that all the agents share a common environment state, therefore never involved the state graph (describing dynamics couplings between different agents).

(iii). To show the benefits of employing the constructed LVFs in policy gradient algorithms, we focus on zeroth-order optimization (ZOO)555The ZOO method can be implemented with very limited information (only the objective function evaluations), therefore has a wide range of applications. In recent years, the ZOO-based RL algorithms have been shown efficient in solving model-free optimal control problems, e.g., [29, 30, 17]. Inspired by these facts, we employ the ZOO-based method to deal with the model-free optimal distributed control problem of MASs under a very general formulation.. Due to the removal of redundant information (less agents are involved) in gradient estimation, our learning framework based on LVFs always exhibits a reduced gradient estimation variance (see Remark 3) compared with GVF-based policy evaluation. Note that most of the existing distributed ZOO algorithms [31, 32, 33, 34] essentially evaluate policies via the global value.

(iv). To deal with the scenario when the learning graph is dense, we construct TLVFs by further neglecting the couplings between agents that are far away in the coupling graph. The underlying idea is motivated by [8, 26, 9]. Our design, however, is different from them as they construct a TLVF for each agent, whereas we deign a TLVF for each cluster.

The rest of this paper is organized as follows. Section II describes the MARL formulation and the main goal of this paper. Section III introduces the LVF design and the learning graph derivation. Section IV shows the distributed RL algorithm based on LVFs and local consensus, and provides convergence analysis. Section V introduces the RL algorithm based on TLVFs as well as the convergence analysis. Section VII shows several simulation examples to illustrate the advantages of our proposed algorithms. Section VIII concludes the paper. Sections IX and X provide theoretical proofs and the relationships among different cluster-wise graphs, respectively.

Notation: Throughout the paper, unless otherwise stated, 𝒢X=(𝒱,X)\mathcal{G}_{X}=(\mathcal{V},\mathcal{E}_{X}) always denotes an unweighted directed graph666In this paper, we will introduce multiple graphs, here XX may represent SS, OO, CC, RR and LL., where 𝒱={1,,N}\mathcal{V}=\{1,...,N\} is the set of vertices, X𝒱×𝒱\mathcal{E}_{X}\subset\mathcal{V}\times\mathcal{V} is the set of edges, (i,j)X(i,j)\in\mathcal{E}_{X} means that there is a directional edge in 𝒢\mathcal{G} from ii to jj. The in-neighbor set and out-neighbor set of agent ii are denoted by 𝒩iX={j𝒱:(j,i)X}\mathcal{N}_{i}^{X}=\{j\in\mathcal{V}:(j,i)\in\mathcal{E}_{X}\}, and 𝒩iX+={j𝒱:(i,j)X}\mathcal{N}_{i}^{X+}=\{j\in\mathcal{V}:(i,j)\in\mathcal{E}_{X}\}, respectively. A path from ii to jj is a sequence of distinct edges of the form (i1,i2)(i_{1},i_{2}), (i2,i3)(i_{2},i_{3}), …, (ir1,ir)(i_{r-1},i_{r}) where i1=ii_{1}=i and ir=ji_{r}=j. We use iji\stackrel{{\scriptstyle\mathcal{E}}}{{\longrightarrow}}j to denote that there is a path from ii to jj in edge set \mathcal{E}. A subgraph 𝒢=(𝒱,)\mathcal{G}^{\prime}=(\mathcal{V}^{\prime},\mathcal{E}^{\prime}) with 𝒱𝒱\mathcal{V}^{\prime}\subseteq\mathcal{V} and \mathcal{E}^{\prime}\subseteq\mathcal{E} is said to be a strongly connected component (SCC) if there is a path between any two vertices in 𝒢\mathcal{G}^{\prime}. One vertex is a special SCC. Given a directed graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), we define the transpose graph of 𝒢\mathcal{G} as 𝒢=(𝒱,)\mathcal{G}^{\top}=(\mathcal{V},\mathcal{E}^{\top}), where ={(i,j)𝒱×𝒱:(j,i)}\mathcal{E}^{\top}=\{(i,j)\in\mathcal{V}\times\mathcal{V}:(j,i)\in\mathcal{E}\}. Given two edge sets 1\mathcal{E}_{1} and 2\mathcal{E}_{2}, it can be verified that 12\mathcal{E}_{1}\subseteq\mathcal{E}_{2} if and only if 12\mathcal{E}_{1}^{\top}\subseteq\mathcal{E}_{2}^{\top}. Moreover, if (i,j),(j,i)(i,j),(j,i)\in\mathcal{E}, then (i,j),(j,i)(i,j),(j,i)\in\mathcal{E}^{\top}. Given a set AA and a vector vv, vA=(,vi,)v_{A}=(...,v_{i},...)^{\top} with iAi\in A. Given a set XX, 𝒫(X)\mathcal{P}(X) is the set of probability distributions over XX. The d×dd\times d identity matrix is denoted by IdI_{d}, the a×ba\times b zero matrix is denoted by 𝟎a×b\mathbf{0}_{a\times b}. d\mathbb{R}^{d} is the dd-dimensional Euclidean space. \mathbb{N} is the set of non-negative integers.

II Multi-Agent Reinforcement Learning

Consider the optimal control problem of a MAS modeled by a Markov decision process (MDP), which is described as a tuple =(𝒢{S,O,R,C},Πi𝒱𝒯i,Πi𝒱𝒪i,Πi𝒱ri,γ)\mathcal{M}=(\mathcal{G}_{\{S,O,R,C\}},\Pi_{i\in\mathcal{V}}\mathcal{T}_{i},\Pi_{i\in\mathcal{V}}\mathcal{O}_{i},\Pi_{i\in\mathcal{V}}r_{i},\gamma) with 𝒢{S,O,R,C}=(𝒱,{S,O,R,C})\mathcal{G}_{\{S,O,R,C\}}=(\mathcal{V},\mathcal{E}_{\{S,O,R,C\}}) describing different interaction graphs and 𝒯i=(𝒮i,𝒜i,𝒫i)\mathcal{T}_{i}=(\mathcal{S}_{i},\mathcal{A}_{i},\mathcal{P}_{i}) specifying the evolution process777The evolution process of each agent depends on other agents, therefore does not possess the Markov property. However, the whole MAS has the Markov property as its full state only depends on the state and action at last step, and is independent of previous states and actions. of agent ii. Detailed notation explanations are listed below:

  • 𝒱={1,,N}\mathcal{V}=\{1,...,N\} is the set of agent indices;

  • S𝒱×𝒱\mathcal{E}_{S}\subseteq\mathcal{V}\times\mathcal{V} is the edge set of the state graph 𝒢S=(𝒱,S)\mathcal{G}_{S}=(\mathcal{V},\mathcal{E}_{S}), which specifies the dynamics couplings among the agents’ states, (i,j)S(i,j)\in\mathcal{E}_{S} implies that the state evolution of agent jj involves the state of agent ii;

  • O𝒱×𝒱\mathcal{E}_{O}\subseteq\mathcal{V}\times\mathcal{V} is the edge set of the observation graph 𝒢O=(𝒱,O)\mathcal{G}_{O}=(\mathcal{V},\mathcal{E}_{O}), which determines the partial observation of each agent. More specifically, agent ii observes the state of agent jj if (j,i)O(j,i)\in\mathcal{E}_{O};

  • R𝒱×𝒱\mathcal{E}_{R}\subseteq\mathcal{V}\times\mathcal{V} is the edge set of the reward graph 𝒢R=(𝒱,R)\mathcal{G}_{R}=(\mathcal{V},\mathcal{E}_{R}), which describes inter-agent couplings in the reward of each individual agent, the reward of agent ii involves the state and the action of agent jj if (j,i)R(j,i)\in\mathcal{E}_{R};

  • C𝒱×𝒱\mathcal{E}_{C}\subseteq\mathcal{V}\times\mathcal{V} is the edge set of the communication graph 𝒢C=(𝒱,C)\mathcal{G}_{C}=(\mathcal{V},\mathcal{E}_{C}). An edge (i,j)C(i,j)\in\mathcal{E}_{C} implies that agent jj is able to receive information from agent ii;

  • 𝒮i\mathcal{S}_{i} and 𝒜i\mathcal{A}_{i} are the state space and the action space of agent ii, respectively, and can be either continuous or finite;

  • 𝒫i:ΠjiS𝒮j×ΠjiS𝒜j𝒫(𝒮i)\mathcal{P}_{i}:\Pi_{j\in\mathcal{I}_{i}^{S}}\mathcal{S}_{j}\times\Pi_{j\in\mathcal{I}_{i}^{S}}\mathcal{A}_{j}\rightarrow\mathcal{P}(\mathcal{S}_{i}) is the transition probability function specifying the state probability distribution of agent ii at the next time step under current states {sj}jiS\{s_{j}\}_{j\in\mathcal{I}_{i}^{S}} and actions {aj}jiS\{a_{j}\}_{j\in\mathcal{I}_{i}^{S}}, here iS={j𝒱:(j,i)S}{i}\mathcal{I}_{i}^{S}=\{j\in\mathcal{V}:(j,i)\in\mathcal{E}_{S}\}\cup\{i\} includes agent ii and its neighbors in the state graph 𝒢S\mathcal{G}_{S};

  • ri:ΠjiR𝒮j×ΠjiR𝒜jr_{i}:\Pi_{j\in\mathcal{I}_{i}^{R}}\mathcal{S}_{j}\times\Pi_{j\in\mathcal{I}_{i}^{R}}\mathcal{A}_{j}\rightarrow\mathbb{R} is the immediate reward returned to agent ii when each agent jiRj\in\mathcal{I}^{R}_{i} takes action aj𝒜ja_{j}\in\mathcal{A}_{j} at the current state sj𝒮js_{j}\in\mathcal{S}_{j}, here iR={j𝒱:(j,i)R}{i}\mathcal{I}_{i}^{R}=\{j\in\mathcal{V}:(j,i)\in\mathcal{E}_{R}\}\cup\{i\};

  • 𝒪i={ΠjiO𝒮j}\mathcal{O}_{i}=\{\Pi_{j\in\mathcal{I}_{i}^{O}}\mathcal{S}_{j}\} is the observation space of agent ii, which includes the states of all the agents in iO\mathcal{I}_{i}^{O}, here iO={j𝒱:(j,i)O}{i}\mathcal{I}_{i}^{O}=\{j\in\mathcal{V}:(j,i)\in\mathcal{E}_{O}\}\cup\{i\};

  • γ(0,1)\gamma\in(0,1) is the discount factor that trades off the instantaneous and future rewards.

Let 𝒮=Πj𝒱𝒮j\mathcal{S}=\Pi_{j\in\mathcal{V}}\mathcal{S}_{j}, 𝒜=Πj𝒱𝒜j\mathcal{A}=\Pi_{j\in\mathcal{V}}\mathcal{A}_{j}, and 𝒫=Πj𝒱𝒫j\mathcal{P}=\Pi_{j\in\mathcal{V}}\mathcal{P}_{j} denote the joint state space, action space and transition probability function of the whole MAS. Each agent ii has a state si𝒮is_{i}\in\mathcal{S}_{i} and an action ai𝒜ia_{i}\in\mathcal{A}_{i}. The global state and action at time step tt are denoted by s(t)=(s1(t),,sN(t))s(t)=(s_{1}(t),...,s_{N}(t)) and a(t)=(a1(t),,aN(t))a(t)=(a_{1}(t),...,a_{N}(t)), respectively. Let π:𝒮𝒫(𝒜)\pi:\mathcal{S}\rightarrow\mathcal{P}(\mathcal{A}) and πi:𝒪i𝒫(𝒜i)\pi_{i}:\mathcal{O}_{i}\rightarrow\mathcal{P}(\mathcal{A}_{i}) be a global policy function of the MAS and a local policy function of agent ii, respectively. Here 𝒫(𝒜)\mathcal{P}(\mathcal{A}) and 𝒫(𝒜i)\mathcal{P}(\mathcal{A}_{i}) are the sets of probability distributions over 𝒜\mathcal{A} and 𝒜i\mathcal{A}_{i}, respectively. The global policy is the policy of the whole MAS from the centralized perspective, thus is based on the global state ss. The local policy of agent ii is based on the local observation oi=siO𝒪io_{i}=s_{\mathcal{I}_{i}^{O}}\in\mathcal{O}_{i}, which constitutes states of partial agents. Note that a global policy always corresponds to a collection of local policies uniquely.

At each time step tt of the MDP, each agent i𝒱i\in\mathcal{V} executes an action ai(t)𝒜ia_{i}(t)\in\mathcal{A}_{i} according to its policy πi\pi_{i} and the local observation oi(t)=siO(t)𝒪io_{i}(t)=s_{\mathcal{I}_{i}^{O}}(t)\in\mathcal{O}_{i}, then obtains a reward ri(siR(t),aiR(t))r_{i}(s_{\mathcal{I}_{i}^{R}}(t),a_{\mathcal{I}_{i}^{R}}(t)). Note that such a formulation is different from that in [13, 35], where the transition and reward of each agent are associated with the global state s(t)s(t). Moreover, different from many MARL references where the reward of each agent only depends on its own state and action, in our work, we consider a more general formulation for cooperative MARL where the reward of each agent may be influenced by other agents, determined by the reward graph 𝒢R=(𝒱,R)\mathcal{G}_{R}=(\mathcal{V},\mathcal{E}_{R}).

The long-term accumulated discounted global reward is defined as

R=t=0γtr(s(t),a(t))=i=1Nt=0γtri(siR(t),aiR(t)),R=\sum_{t=0}^{\infty}\gamma^{t}r(s(t),a(t))=\sum_{i=1}^{N}\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{\mathcal{I}_{i}^{R}}(t),a_{\mathcal{I}_{i}^{R}}(t)), (1)

where r(s(t),a(t))r(s(t),a(t)) is the global reward for the MAS at time tt, rir_{i} is the local reward for agent ii at time tt. Note that maximizing RR is equivalent to maximizing its average 1NR\frac{1}{N}R, which has been commonly adopted as the learning objective in many MARL references, e.g. [21, 13, 12]. Based on this long term global reward, with a given policy π\pi, we are able to define the global state value function Vπ(s)=𝔼[t=0γtr(s(t),a(t))|s(0)=s]V^{\pi}(s)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r(s(t),a(t))|s(0)=s] and state-action value function Qπ(s,a)=𝔼[t=0γtr(s(t),a(t))|s(0)=s,a(0)=a]Q^{\pi}(s,a)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r(s(t),a(t))|s(0)=s,a(0)=a], which describe the expected long term global reward when agents have initial state ss and initial state-action pair (s,a)(s,a), respectively. Similarly, the local state value function with initial state ss for each agent ii can be defined as Viπ(s)=𝔼[t=0γtri(siR(t),aiR(t))|s(0)=s]V_{i}^{\pi}(s)=\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{\mathcal{I}_{i}^{R}}(t),a_{\mathcal{I}_{i}^{R}}(t))|s(0)=s].

The goal of this paper is to design a distributed RL algorithm for the MAS to find a control policy π\pi maximizing J(π)=𝔼s𝒟Vπ(s)J(\pi)=\mathbb{E}_{s\sim\mathcal{D}}V^{\pi}(s), whose expression is

𝔼s𝒟[i=1Nt=0𝔼a(t)π(s(t))γtri(siR(t),aiR(t))|s(0)=s],\begin{split}&\mathbb{E}_{s\sim\mathcal{D}}\left[\sum_{i=1}^{N}\sum_{t=0}^{\infty}\mathbb{E}_{a(t)\sim\pi(s(t))}\gamma^{t}r_{i}(s_{\mathcal{I}_{i}^{R}}(t),a_{\mathcal{I}_{i}^{R}}(t))|s(0)=s\right],\end{split} (2)

where 𝒟\mathcal{D} denotes the distribution that the initial state follows. For convenience of analysis, we also define the expected value to be maximized corresponding to individual reward for each agent ii as

Ji(π)=𝔼s𝒟Viπ(s),i𝒱.J_{i}(\pi)=\mathbb{E}_{s\sim\mathcal{D}}V_{i}^{\pi}(s),i\in\mathcal{V}. (3)

Note that here JiJ_{i} may be determined by the policies of partial agents, instead of the global policy π\pi. However, the global policy π\pi is always able to determine JiJ_{i}, therefore, can be employed as the argument of JiJ_{i}.

We parameterize the global policy π(s,a)\pi(s,a) using parameters θ=(θ1,,θN)d\theta=(\theta_{1}^{\top},...,\theta_{N}^{\top})^{\top}\in\mathbb{R}^{d} with θidi\theta_{i}\in\mathbb{R}^{d_{i}}. The global policy and agent ii’s local policy are then rewritten as πθ(s,a)\pi^{\theta}(s,a) and πiθi(oi,ai)\pi^{\theta_{i}}_{i}(o_{i},a_{i}), respectively. Note that given any global state s𝒮s\in\mathcal{S}, a global policy and a collection of local policies can always be transformed to each other. Now we turn to solve the following optimization problem:

maxθJ(θ):=𝔼s𝒟Vπ(θ)(s).\max_{\theta}J(\theta):=\mathbb{E}_{s\sim\mathcal{D}}V^{\pi(\theta)}(s). (4)

Next we present a distributed multi-warehouse resource transfer problem to demonstrate our formulation. This example is a variation of many practical applications, e.g., it can also be read as a problem of energy transfer among different rooms in a smart building.

Example 1

Consider a network of 9 warehouses 𝒱={1,,9}\mathcal{V}=\{1,...,9\} consuming resources while transferring resources among each other. The goal is to guarantee adequate supplies for each warehouse. Each warehouse is denoted by a vertex in the graph. The state graph 𝒢S=(𝒱,S)\mathcal{G}_{S}=(\mathcal{V},\mathcal{E}_{S}) interpreting the transition relationship is shown in Fig. 1. The observation graph 𝒢O\mathcal{G}_{O} only contains 3 edges involving 3 leaf nodes in graph 𝒢S\mathcal{G}_{S}, as shown in Fig. 1, which implies that only warehouses 2, 3, and 5 observe the current resource stock of warehouses other than itself. The motivation behind this setting is that warehouses 1, 4 and 6 do not send out resources at all, hence their neighbors need to keep monitoring their states so that the resources sent to them are neither insufficient nor redundant. The reward graph 𝒢R\mathcal{G}_{R} is shown in Fig. 2, which contains 𝒢O\mathcal{G}_{O} as a subgraph. This ensures that the observation of each warehouse always influences its own reward, implying that a warehouse is responsible for the resource shortage of those warehouses it can observe. At time step tt, warehouse i𝒱i\in\mathcal{V} stores resources of the amount mi(t)m_{i}(t)\in\mathbb{R}, receives a local demand di(t)d_{i}(t)\in\mathbb{R}, sends partial of its resources to and receives resources from its neighbors j𝒩iSj\in\mathcal{N}_{i}^{S} in the state graph 𝒢S\mathcal{G}_{S}, besides its neighbors, warehouse ii also receives resources supply of the amount yi(t)y_{i}(t) from outside. Let zi(t)=yi(t)di(t)z_{i}(t)=y_{i}(t)-d_{i}(t), then agent ii has the following dynamics

mi(t+1)\displaystyle m_{i}(t+1) =mi(t)j𝒩ioutbij(oi(t))αimi(t)\displaystyle=m_{i}(t)-\sum_{j\in\mathcal{N}_{i}^{out}}b_{ij}(o_{i}(t))\alpha_{i}m_{i}(t)
+j𝒩iCbji(oj(t))αjmj(t)+zi(t),\displaystyle~{}~{}~{}~{}+\sum_{j\in\mathcal{N}_{i}^{C}}b_{ji}(o_{j}(t))\alpha_{j}m_{j}(t)+z_{i}(t),
zi(t)\displaystyle z_{i}(t) =Aisin(wit+ϕi)+ωi,\displaystyle=A_{i}\sin(w_{i}t+\phi_{i})+\omega_{i}, (5)

where bij(oi(t))[0,1]b_{ij}(o_{i}(t))\in[0,1] denotes the fraction of resources agent ii sends to its neighbor jj at time tt, αi\alpha_{i} determines whether the ii-th warehouse has resources to send out, i.e., αi=0\alpha_{i}=0 if mi0m_{i}\leq 0, and αi=1\alpha_{i}=1 otherwise, 0<Ai<mi(0)0<A_{i}<m_{i}(0) is a constant, wiw_{i} is a bounded random quantity, i𝒱i\in\mathcal{V} and ϕ\phi is a positive scalar, 𝒩iS={j𝒱:(j,i)S}\mathcal{N}_{i}^{S}=\{j\in\mathcal{V}:(j,i)\in\mathcal{E}_{S}\} and 𝒩iS+={j𝒱:(i,j)S}\mathcal{N}_{i}^{S+}=\{j\in\mathcal{V}:(i,j)\in\mathcal{E}_{S}\} are the in-neighbor set and the out-neighbor set of agent ii, respectively.

From the MARL perspective, besides the three graphs and transition dynamics introduced above, the rest of entries in \mathcal{M} for each agent ii at time step tt can be recognized as Individual state: si(t)=(mi(t),zi(t))s_{i}(t)=(m_{i}(t),z_{i}(t))^{\top}. Individual action: ai(t)=(,bij(oi(t)),)j𝒩iS+a_{i}(t)=(...,b_{ij}(o_{i}(t)),...)^{\top}_{j\in\mathcal{N}_{i}^{S+}}. Individual policy function: πi()=(,bij(),)j𝒩iS+\pi_{i}(\cdot)=(...,b_{ij}(\cdot),...)^{\top}_{j\in\mathcal{N}_{i}^{S+}}. Partial observation: oi(t)=({mj(t)}jiO,zi(t))|iO|+1o_{i}(t)=(\{m_{j}(t)\}_{j\in\mathcal{I}_{i}^{O}},z_{i}(t))^{\top}\in\mathbb{R}^{|\mathcal{I}_{i}^{O}|+1}. Individual reward: ri(t)=jiRτj(t)r_{i}(t)=\sum_{j\in\mathcal{I}_{i}^{R}}\tau_{j}(t), where τj(t)=0\tau_{j}(t)=0 if mj(t)0m_{j}(t)\geq 0, and τj(t)=mj2(t)\tau_{j}(t)=-m_{j}^{2}(t) otherwise.

The goal of the resource transfer problem is to maximize 𝔼s(0)𝒟i=1Nt=0γtri(t)\mathbb{E}_{s(0)\sim\mathcal{D}}\sum_{i=1}^{N}\sum_{t=0}^{\infty}\gamma^{t}r_{i}(t) under the dynamics constraint (1). In other words, we aim to find the optimal transfer policy such that each warehouse keeps having enough resources for use.

Remark 1

Note that many settings in this example can be adjusted while maintaining the applicability of the proposed approach in this paper. For example, the partial observation of each agent ii can be replaced by oi=(,mj(t),dj(t),)jiOo_{i}=(...,m_{j}(t),d_{j}(t),...)^{\top}_{j\in\mathcal{I}_{i}^{O}} or oi=(,mj(t),)jiOo_{i}=(...,m_{j}(t),...)^{\top}_{j\in\mathcal{I}_{i}^{O}}. Depending on different observation settings, the optimal policy may change.

Refer to caption
Figure 1: The state graph 𝒢S=(𝒱,S)\mathcal{G}_{S}=(\mathcal{V},\mathcal{E}_{S}) and the observation graph 𝒢O=(𝒱,O)\mathcal{G}_{O}=(\mathcal{V},\mathcal{E}_{O}). The black and red lines correspond to edges in S\mathcal{E}_{S} and O\mathcal{E}_{O}, respectively.
Refer to caption
Figure 2: The reward graph 𝒢R=(𝒱,R)\mathcal{G}_{R}=(\mathcal{V},\mathcal{E}_{R}).

Existing distributed policy gradient methods such as actor-critic [13] and zeroth-order optimization [12] can be employed to solve the problem when there is a connected undirected communication graph among the agents. However, these approaches are based on estimation of the GVF, which requires a large amount of communications during each learning episode. Moreover, policy evaluation based on the GVF has a significant scalability issue due to the high dimension of the state and action spaces for large-scale networks.

III Local Value Function and Learning Graph

In this section, we introduce how to design an appropriate LVF for each agent, which involves only partial agents, but its gradient w.r.t. the local policy parameter is the same as that of the GVF.

III-A Local Value Function Design

Although the state graph 𝒢S\mathcal{G}_{S}, the observation graph 𝒢O\mathcal{G}_{O}, and the reward graph 𝒢R\mathcal{G}_{R} can be defined independently, we observe that all of them will induce the couplings between different agents in the optimization objective. In this subsection, we will build a connection between these graphs and the couplings between agents, based on which the LVFs will be designed.

Define a new graph 𝒢SO(𝒱,SO)\mathcal{G}_{SO}\triangleq(\mathcal{V},\mathcal{E}_{SO}) where SO=SO\mathcal{E}_{SO}=\mathcal{E}_{S}\cup\mathcal{E}_{O}, and define

iSO={j𝒱:iSOj}{i},\mathcal{R}^{SO}_{i}=\{j\in\mathcal{V}:i\stackrel{{\scriptstyle\mathcal{E}_{SO}}}{{\longrightarrow}}j\}\cup\{i\}, (6)

which includes the vertices in graph 𝒢SO\mathcal{G}_{SO} that are reachable from vertex ii and vertex ii itself. In fact, the states of the agents in iSO\mathcal{R}_{i}^{SO} will be affected by agent ii’s state and action as time goes on.

To design the LVF for each agent ii, we need to specify the agents whose individual rewards will be affected by the action of agent ii. To this end, we define the following composite reward for agent ii:

r^i(s(t),a(t))=jiLrj(sjR(t),ajR(t)),\hat{r}_{i}(s(t),a(t))=\sum_{j\in\mathcal{I}_{i}^{L}}r_{j}(s_{\mathcal{I}_{j}^{R}}(t),a_{\mathcal{I}_{j}^{R}}(t)), (7)

where

iL={j𝒱:jRiSO}=kiSOkR+,\mathcal{I}_{i}^{L}=\{j\in\mathcal{V}:\mathcal{I}_{j}^{R}\cap\mathcal{R}_{i}^{SO}\neq\varnothing\}=\cup_{k\in\mathcal{R}_{i}^{SO}}\mathcal{I}_{k}^{R+}, (8)

here kR+={j𝒱:(k,j)R}{k}\mathcal{I}_{k}^{R+}=\{j\in\mathcal{V}:(k,j)\in\mathcal{E}_{R}\}\cup\{k\} consists of the out-neighbors of vertex kk in graph 𝒢R\mathcal{G}_{R} and itself.

To demonstrate the definitions of iSO\mathcal{R}_{i}^{SO} and iL\mathcal{I}_{i}^{L}, let us look at Example 1. One can observe from Fig. 1 and Fig. 2 that 1SO=2SO=1L=2L={1,2,3,4}\mathcal{R}_{1}^{SO}=\mathcal{R}_{2}^{SO}=\mathcal{I}_{1}^{L}=\mathcal{I}_{2}^{L}=\{1,2,3,4\}. In fact, we have iSO=iL\mathcal{R}_{i}^{SO}=\mathcal{I}_{i}^{L} since kR+iSO\mathcal{I}_{k}^{R+}\subset\mathcal{R}_{i}^{SO} for all kiSOk\in\mathcal{R}_{i}^{SO}, i𝒱i\in\mathcal{V}.

Accordingly, we define the LVF for agent ii as

V^iπ(s)=𝔼[t=0γtr^i(s(t),a(t))|s(0)=s].\hat{V}^{\pi}_{i}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}\hat{r}_{i}(s(t),a(t))|s(0)=s\right]. (9)

When the GVF is replaced by the LVF, the agent ii is expected to maximize the following objective:

J^i(θ)=𝔼s𝒟V^iπ(θ)(s)=jiLJj(θ).\hat{J}_{i}(\theta)=\mathbb{E}_{s\sim\mathcal{D}}\hat{V}^{\pi(\theta)}_{i}(s)=\sum_{j\in\mathcal{I}^{L}_{i}}J_{j}(\theta). (10)

Different from the global objective function J(θ)=j𝒱Jj(θ)J(\theta)=\sum_{j\in\mathcal{V}}J_{j}(\theta), the local objective J^i(θ)\hat{J}_{i}(\theta) only involves agents in a subset iL𝒱\mathcal{I}_{i}^{L}\subseteq\mathcal{V}. We make the following assumption on the graphs so that J^i(θ)J(θ)\hat{J}_{i}(\theta)\neq J(\theta) for at least one agent ii.

Assumption 1

There exists a vertex i𝒱i\in\mathcal{V} such that iL𝒱\mathcal{I}_{i}^{L}\neq\mathcal{V}.

Define graph 𝒢SOR=𝒢SO𝒢R\mathcal{G}_{SOR}=\mathcal{G}_{SO}\cup\mathcal{G}_{R}. The following lemma shows a sufficient graphical condition and a necessary graphical condition for Assumption 1.

Lemma 1

The following statements are true:

(i). Assumption 1 holds if graph 𝒢SOR\mathcal{G}_{SOR} has n>1n>1 SCCs.

(ii). Assumption 1 holds only if graph 𝒢SO\mathcal{G}_{SO} has n>1n>1 SCCs.

One may question if the converses of the statements in Lemma 1 are true. Both answers are no. This is because graph 𝒢R\mathcal{G}_{R} may contain some edges that connect different SCCs in 𝒢SO\mathcal{G}_{SO}, but the paths involving more than two vertices in 𝒢R\mathcal{G}_{R} cannot be used in expanding iR\mathcal{I}_{i}^{R}. For statement (i), 𝒢SOR\mathcal{G}_{SOR} may be strongly connected even when there exists a vertex j𝒱iLj\in\mathcal{V}\setminus\mathcal{I}_{i}^{L}. Fig. 3 shows a counter-example where 1L={1,2,4}\mathcal{I}_{1}^{L}=\{1,2,4\} is only a subset of 𝒱\mathcal{V} but 𝒢SOR\mathcal{G}_{SOR} is strongly connected. For statement (ii), a simple counter-example can be obtained by setting R=𝒱×𝒱\mathcal{E}_{R}=\mathcal{V}\times\mathcal{V}. Note that Lemma 1 induces a necessary and sufficient condition for Assumption 1 when 𝒢SO=𝒢SOR\mathcal{G}_{SO}=\mathcal{G}_{SOR}, which happens when 𝒢R𝒢SO\mathcal{G}_{R}\subseteq\mathcal{G}_{SO}.

Refer to caption
Figure 3: A counter-example for the converse of Lemma 1 (i). Graphs (a), (b), (c), and (d) denote graphs 𝒢SO\mathcal{G}_{SO}, 𝒢R\mathcal{G}_{R}, 𝒢SOR\mathcal{G}_{SOR}, and the learning graph 𝒢L\mathcal{G}_{L}, respectively.

Next we show that each agent i𝒱i\in\mathcal{V} maximizing (10) is equivalent to maximizing the global objective (4).

Given a function f(θ):df(\theta):\mathbb{R}^{d}\rightarrow\mathbb{R} and a positive δ\delta, we define

fδ(θ)=𝔼[f(θ+δu)],u𝒩(0,Id).f^{\delta}(\theta)=\mathbb{E}[f(\theta+\delta u)],~{}~{}u\sim\mathcal{N}(0,I_{d}). (11)

The following lemma shows the equivalence between the gradients of the smoothed local objective J^iδ(θ)\hat{J}_{i}^{\delta}(\theta) and the smoothed global objective Jδ(θ)J^{\delta}(\theta) w.r.t. the local policy parameter of each individual agent.

Lemma 2

The following statements are true:

(i) θiJδ(θ)=θiJ^iδ(θ)\nabla_{\theta_{i}}J^{\delta}(\theta)=\nabla_{\theta_{i}}\hat{J}_{i}^{\delta}(\theta) for any δ>0\delta>0, i𝒱i\in\mathcal{V}.

(ii) If Ji(θ)J_{i}(\theta), i𝒱i\in\mathcal{V} are differentiable, then θiJ(θ)=θiJ^i(θ)\nabla_{\theta_{i}}J(\theta)=\nabla_{\theta_{i}}\hat{J}_{i}(\theta), i𝒱i\in\mathcal{V}.

Lemma 2 reveals the implicit connection between graphs and agents’ couplings in the optimization objective, and provides the theoretical guarantee for the reasonability of the designed LVFs. It is important to give the following two notes. (a). Although the RL algorithm in this paper is based on ZOO, Lemma 2 is independent of ZOO. Therefore, Lemma 2 is also compatible with other policy gradient algorithms. (b). Statement (i) in Lemma 2 does not require Ji(θ)J_{i}(\theta), i𝒱i\in\mathcal{V} to be differentiable because Jδ(θ)=𝔼[J(θ+δu)]J^{\delta}(\theta)=\mathbb{E}[J(\theta+\delta u)] is always differentiable [36, Lemma 2].

In order to adapt our approach to the scenario when J(θ)J(\theta) is not differentiable, we choose to find the stationary point of Jδ(θ)J^{\delta}(\theta). The gap between J(θ)J(\theta) and Jδ(θ)J^{\delta}(\theta) can be bounded if J(θ)J(\theta) is Lipschitz continuous and δ>0\delta>0 is sufficiently small.

To guarantee the Lipschitz continuity888The Lipschitz continuity of a value function implies that similar policy parameters have similar performance for the problem. This is reasonable in practice especially for problems with continuous state and action spaces. In [37], it has been shown that the value function becomes Lipschitz continuous w.r.t. policy parameters as long as both the MDP and the policy function have Lipschitz continuity properties. of J(θ)J(\theta), we make the following assumption on functions Viπ(θ)(s)V_{i}^{\pi(\theta)}(s) for i𝒱i\in\mathcal{V}:

Assumption 2

Viπ(θ)(s)V_{i}^{\pi(\theta)}(s), i𝒱i\in\mathcal{V} are LiL_{i}-Lipschitz continuous w.r.t. θ\theta in d\mathbb{R}^{d} for any s𝒮s\in\mathcal{S}. That is, |Viπ(θ)(s)Viπ(θ)(s)|Liθθ|V_{i}^{\pi(\theta)}(s)-V_{i}^{\pi(\theta^{\prime})}(s)|\leq L_{i}\|\theta-\theta^{\prime}\| for any s𝒮s\in\mathcal{S}, θ,θd\theta,\theta^{\prime}\in\mathbb{R}^{d}.

Assumption 2 directly implies that Ji(θ)J_{i}(\theta) is LiL_{i}-Lipschitz continuous. Moreover, J(θ)J(\theta) is LL-Lipschitz continuous in d\mathbb{R}^{d}, where Li𝒱LiL\triangleq\sum_{i\in\mathcal{V}}L_{i}, due to the following fact:

|J(θ)J(θ)|i𝒱|Ji(θ)Ji(θ)|i𝒱𝔼[|Viπ(θ)(s)Viπ(θ)(s)|]i𝒱Liθθ.\begin{split}&|J(\theta)-J(\theta^{\prime})|\leq\sum_{i\in\mathcal{V}}|J_{i}(\theta)-J_{i}(\theta^{\prime})|\\ &\leq\sum_{i\in\mathcal{V}}\mathbb{E}\left[\left|V_{i}^{\pi(\theta)}(s)-V_{i}^{\pi(\theta^{\prime})}(s)\right|\right]\leq\sum_{i\in\mathcal{V}}L_{i}\|\theta-\theta^{\prime}\|.\end{split} (12)

III-B Learning Graph

Lemma 2 has shown that having the local gradient of a specific local objective function is sufficient for each agent to optimize its policy according to the following gradient ascent:

θik+1=θik+θiJ^iδ(θk),i𝒱,\theta_{i}^{k+1}=\theta_{i}^{k}+\nabla_{\theta_{i}}\hat{J}_{i}^{\delta}(\theta^{k}),~{}~{}~{}~{}i\in\mathcal{V}, (13)

where θk\theta^{k} is the policy parameter at step kk, and θiJ^iδ(θk)\nabla_{\theta_{i}}\hat{J}_{i}^{\delta}(\theta^{k}) can be estimated by evaluating the value of J^iδ(θk)\hat{J}_{i}^{\delta}(\theta^{k}).

Then we are able to define the learning graph 𝒢L=(𝒱,L)\mathcal{G}_{L}=(\mathcal{V},\mathcal{E}_{L}) based on the set iL\mathcal{I}_{i}^{L} (8) in the LVF design, which interprets the required reward information flow during the learning process. The edge set L\mathcal{E}_{L} is defined as:

L={(j,i)𝒱×𝒱:jiL,i𝒱}.\mathcal{E}_{L}=\{(j,i)\in\mathcal{V}\times\mathcal{V}:j\in\mathcal{I}_{i}^{L},i\in\mathcal{V}\}. (14)

The definition of L\mathcal{E}_{L} implies the following result.

Lemma 3

If (j,i)L(j,i)\in\mathcal{E}_{L}, then iSORji\stackrel{{\scriptstyle\mathcal{E}_{SOR}}}{{\longrightarrow}}j.

The converse of Lemma 3 is not true, see Fig. 3 as a counterexample. More specifically, 1SOR31\stackrel{{\scriptstyle\mathcal{E}_{SOR}}}{{\longrightarrow}}3, as shown in graph (c), however, (3,1)L(3,1)\notin\mathcal{E}_{L}, see the learning graph (d)(d).

To better understand the learning graph 𝒢L\mathcal{G}_{L}, we find a clustering 𝒱=l=1n𝒱l\mathcal{V}=\cup_{l=1}^{n}\mathcal{V}_{l} for the graph 𝒢SO\mathcal{G}_{SO}, where 𝒱l\mathcal{V}_{l} is the vertex set of the ll-th maximal strongly connected component (SCC) in 𝒢SO\mathcal{G}_{SO}, and 𝒱l1𝒱l2=\mathcal{V}_{l_{1}}\cap\mathcal{V}_{l_{2}}=\varnothing for any distinct l1,l2𝒞={1,,n}l_{1},l_{2}\in\mathcal{C}=\{1,...,n\}. According to Lemma 1, such a clustering with n>1n>1 can always be found under Assumption 1.

According to the definitions (8) and (14), we have the following observations:

  • The agents in each cluster 𝒱l\mathcal{V}_{l} form a clique in 𝒢L\mathcal{G}_{L}.

  • The agents in the same cluster share the same LVF.

The first observation holds because any pair of agents in each cluster are reachable to each other and (j,i)L(j,i)\in\mathcal{E}_{L} as long as jj is reachable from ii in graph 𝒢SO\mathcal{G}_{SO}. The second observation holds because iSO=jSO\mathcal{R}_{i}^{SO}=\mathcal{R}_{j}^{SO} for any ii and jj belonging to the same cluster, here iSO\mathcal{R}_{i}^{SO} is defined in (6).

To demonstrate the edge set definition (14), the learning graph corresponding to the state graph and the observation graph in Fig. 1, and the reward graph in Fig. 2, is shown in Fig. 4. In fact, it is interesting to see some connections between different graphs from the cluster-wise perspective. Please refer to Appendix B for more details.

Refer to caption
Figure 4: The learning graph 𝒢L\mathcal{G}_{L} corresponding to 𝒢SO\mathcal{G}_{SO} in Fig. 1 and 𝒢R\mathcal{G}_{R} in Fig. 2.

The learning graph 𝒢L\mathcal{G}_{L} interprets the required reward information flow in our distributed MARL algorithm. If the agents are able to exchange information via communications following 𝒢L\mathcal{G}_{L}, then each agent can acquire the information of its LVF via local communications with others. The zeroth-order oracle in [36] can then be employed to estimate θiJ^iδ(θk)\nabla_{\theta_{i}}\hat{J}_{i}^{\delta}(\theta^{k}) in (13). However, 𝒢L\mathcal{G}_{L} is usually dense, inducing high communication costs, and having such a dense communication graph may be unrealistic in practice. To further relax the condition on the communication graph, in the next section, we will design a distributed RL algorithm based on local consensus algorithms.

IV Distributed RL Based on Local Consensus

In this section, we propose a distributed RL algorithm based on local consensus algorithms and ZOO with policy search in the parameter space. ZOO-based RL with policy search in the action space has been proposed in [38]. Compared to the action space, the parameter space usually has a higher dimension. However, the work in [38] requires the action space to be continuous and leverages the Jacobian of the policy π\pi w.r.t. θ\theta. Our RL algorithm is applicable to both continuous and discrete action spaces and even does not require π\pi to be differentiable. In addition, our distributed learning framework based on LVFs is compatible with policy search in the action space.

IV-A Communication Weights and Distributed RL Design

We have shown that agents in the same strongly connected component share the same LVF. Therefore, there are nn LVFs to be estimated, where nn is the number of maximal SCCs in 𝒢SO\mathcal{G}_{SO}. Moreover, it is unnecessary for an agent to estimate a LVF that is independent of this agent. For notation simplicity, we use lcl\mathcal{I}_{l}^{cl} to denote the index set of agents involved in the LVF for the ll-th cluster, l𝒞l\in\mathcal{C}. As a result, lcl=iL\mathcal{I}_{l}^{cl}=\mathcal{I}_{i}^{L} if i𝒱li\in\mathcal{V}_{l}. Moreover, we denote by nl|lcl|n_{l}\triangleq|\mathcal{I}_{l}^{cl}| the number of agents involved in the LVF of cluster ll. Note that different LVFs for different clusters may involve overlapped agents, that is, it may hold that l1cll2cl\mathcal{I}_{l_{1}}^{cl}\cap\mathcal{I}_{l_{2}}^{cl}\neq\varnothing for different clusters l1l_{1}, l2l_{2}.

Suppose that the communication graph 𝒢C=(𝒱,C)\mathcal{G}_{C}=(\mathcal{V},\mathcal{E}_{C}) is available. To make each agent obtain all the individual rewards involved in its LVF, we design a local consensus algorithm based on which the agents involved in each LVF cooperatively estimate the average of their rewards by achieving average consensus. Define nn communication weight matrices ClN×NC^{l}\in\mathbb{R}^{N\times N}, as follows:

Cijl{>0,ifi,jlcl,(i,j)C;=0,otherwise,l𝒞,C_{ij}^{l}\left\{\begin{aligned} &>0,~{}~{}\text{if}~{}i,j\in\mathcal{I}_{l}^{cl},(i,j)\in\mathcal{E}_{C};\\ &=0,~{}~{}\text{otherwise},\end{aligned}\right.~{}~{}~{}~{}l\in\mathcal{C}, (15)

where 𝒞={1,,n}\mathcal{C}=\{1,...,n\} is the set of indices for clusters.

We assume that given an initial state, by implementing the global joint policy π(θ)=(π1(θ1),,πN(θN))\pi(\theta)=(\pi_{1}(\theta_{1}),...,\pi_{N}(\theta_{N})), each agent ii is able to obtain reward ri(θ,ξi,t)r_{i}(\theta,\xi_{i},t), at each time step t=0,,Te1t=0,...,T_{e}-1, where TeT_{e} is the number of evolution steps for policy evaluation, ξi\xi_{i} accounts for the random effects of both the initial states and the state transition of agents involved in agent ii’s reward, 𝔼[ξi]=0\mathbb{E}[\xi_{i}]=0, and 𝔼[ξi2]=σi2\mathbb{E}[\xi_{i}^{2}]=\sigma_{i}^{2}, which is bounded, i𝒱i\in\mathcal{V}. Then we rewrite the obtained individual value of agent ii as Wi(θ,ξi)t=0Te1γtri(θ,ξi,t)=𝔼[Wi(θ,ξi)]+ξiW_{i}(\theta,\xi_{i})\triangleq\sum_{t=0}^{T_{e}-1}\gamma^{t}r_{i}(\theta,\xi_{i},t)=\mathbb{E}[W_{i}(\theta,\xi_{i})]+\xi_{i}. The quantity ξi\xi_{i} can follow any distribution as long as it has a zero mean and a bounded variance. The zero mean assumption is to ensure that agent ii is able to evaluate its individual value Wi(θ,ξi)W_{i}(\theta,\xi_{i}) and thereby estimate the gradient accurately, if sufficiently many noisy observations are collected. The boundedness of σi\sigma_{i} is to guarantee the boundedness of each noisy observation. Similar assumptions have been made in other RL references, e.g., [39].

We further define W^i(θ,ξ)=jiLWj(θ,ξj)\hat{W}_{i}(\theta,\xi)=\sum_{j\in\mathcal{I}_{i}^{L}}W_{j}(\theta,\xi_{j}) as the observed LVF value of agent ii, and define W(θ,ξ)=i𝒱Wi(θ,ξi)W(\theta,\xi)=\sum_{i\in\mathcal{V}}W_{i}(\theta,\xi_{i}) as the observed GVF value.

The distributed RL algorithm999In Algorithm 1, the transition probability of the MAS is never used. This is consistent with most of model-free RL algorithms in the literature. is shown in Algorithm 1. The consensus algorithm (16) is to make each agent ii in cluster 𝒱l\mathcal{V}_{l} estimate 1nlW^i(θk+δuk,ξk)=1nljiLWj(θk+δuk,ξjk)\frac{1}{n_{l}}\hat{W}_{i}(\theta^{k}+\delta u^{k},\xi^{k})=\frac{1}{n_{l}}\sum_{j\in\mathcal{I}_{i}^{L}}W_{j}(\theta^{k}+\delta u^{k},\xi_{j}^{k}), which is the average of the reward sum among the agents involved in the corresponding LVF.

Algorithm 1 Distributed RL Algorithm

Input: Step-size η\eta, initial state distribution 𝒟\mathcal{D}, number of learning epochs KK, number of evolution steps TeT_{e} (for policy evaluation), iteration number for consensus seeking TcT_{c}, initial policy parameter θ0\theta_{0}, smoothing radius δ>0\delta>0.
Output: θK\theta^{K}.

  • 1.

    for k=0,1,,K1k=0,1,...,K-1 do

  • 2.

    Sample s0k𝒟s_{0}^{k}\sim\mathcal{D}.

  • 3.

    for all i𝒱i\in\mathcal{V} do (Simultaneous Implementation)

  • 4.

    Agent ii samples uik𝒩(0,Idi)u_{i}^{k}\sim\mathcal{N}(0,I_{d_{i}}), implements policy πi(θik+δuik)\pi_{i}(\theta_{i}^{k}+\delta u_{i}^{k}) for t=0,,Te1t=0,...,T_{e}-1, observes Wi(θk+δuk,ξik)W_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k}). For l𝒞l\in\mathcal{C}, sets μikl(0)Wi(θk+δuk,ξik)\mu_{i}^{kl}(0)\leftarrow W_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k}) if ilcli\in\mathcal{I}_{l}^{cl}, and sets μikl(0)0\mu_{i}^{kl}(0)\leftarrow 0 otherwise.

  • 5.

    for v=0,,Tc1v=0,...,T_{c}-1 do

  • 6.

    Agent ii sends μikl(v)\mu_{i}^{kl}(v), l𝒞l\in\mathcal{C} to its neighbors in 𝒢C\mathcal{G}_{C}, and computes μikl(v+1)\mu_{i}^{kl}(v+1) according to the following updating law:

    μikl(v+1)=jiCCijlμjkl(v),\mu_{i}^{kl}(v+1)=\sum_{j\in\mathcal{I}_{i}^{C}}C_{ij}^{l}\mu_{j}^{kl}(v), (16)

    where iC=𝒩iC{i}\mathcal{I}_{i}^{C}=\mathcal{N}_{i}^{C}\cup\{i\}, 𝒩iC\mathcal{N}_{i}^{C} denotes the neighbor set of agent ii in the communication graph 𝒢C\mathcal{G}_{C}.

  • 7.

    end for

  • 8.

    Agent ii estimates its local gradient

    gi(θk,uk,ξk)=nliμikli(Tc)δuik,g_{i}(\theta^{k},u^{k},\xi^{k})=\frac{n_{l_{i}}\mu_{i}^{kl_{i}}(T_{c})}{\delta}u_{i}^{k}, (17)

    where lil_{i} denotes the cluster including ii. Then agent ii updates its policy according to

    θik+1=θik+ηgi(θk,uk,ξk).\theta_{i}^{k+1}=\theta_{i}^{k}+\eta g_{i}(\theta^{k},u^{k},\xi^{k}). (18)
  • 9.

    end for

  • 10.

    end for

To ensure that Algorithm 1 works efficiently, we make the following assumption on graph 𝒢C\mathcal{G}_{C}.

Assumption 3

The communication graph 𝒢C\mathcal{G}_{C} is undirected, and the agents specified by lcl\mathcal{I}_{l}^{cl} form a connected component of 𝒢C\mathcal{G}_{C} for all l𝒞l\in\mathcal{C}.

The following lemma gives a sufficient condition for Assumption 3.

Lemma 4

Given that graph 𝒢C\mathcal{G}_{C} is undirected, Assumption 3 holds if SOC\mathcal{E}_{SO}\subseteq\mathcal{E}_{C}.

Proof: Note that for each cluster l𝒞l\in\mathcal{C}, there must exist a path in 𝒢SO\mathcal{G}_{SO} from cluster ll to any agent in lcl\mathcal{I}_{l}^{cl}, recall that 𝒢C\mathcal{G}_{C} is undirected, agents in lcl\mathcal{I}_{l}^{cl} must be connected in 𝒢C\mathcal{G}_{C}. \blacksquare

Once a communication graph 𝒢C\mathcal{G}_{C} satisfying Assumption 3 is available, we design the communication weights such that the following assumption holds.

Assumption 4

ClC^{l} is doubly stochastic, i.e., Cl𝟏N=𝟏NC^{l}\mathbf{1}_{N}=\mathbf{1}_{N} and 𝟏NCl=𝟏N\mathbf{1}_{N}^{\top}C^{l}=\mathbf{1}_{N}^{\top}, for all l𝒞l\in\mathcal{C}.

Assumption 4 guarantees that average consensus can be achieved among the agents involved in each LVF. Since one agent may be involved in LVFs of multiple clusters, it may keep multiple different nonzero communication weights for the same communication link. From the definition of ClC^{l} in (15), Cjjl=0C_{jj^{\prime}}^{l}=0 for all jiLj\notin\mathcal{I}_{i}^{L}, and j𝒱j^{\prime}\in\mathcal{V}. Then μjkl(t)=0\mu_{j}^{kl}(t)=0 for jiLj\notin\mathcal{I}_{i}^{L} for any p0p\geq 0. Moreover, let C0lnl×nlC^{l}_{0}\in\mathbb{R}^{n_{l}\times n_{l}} be the principle submatrix of ClC^{l} by removing the jj-th row and column for all jiLj\notin\mathcal{I}_{i}^{L}, then Assumption 4 implies that C0lC^{l}_{0} is doubly stochastic for all l𝒞l\in\mathcal{C}. Define ρl=C0l1nl𝟏nl𝟏nl\rho_{l}=\|C^{l}_{0}-\frac{1}{n_{l}}\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top}\|, it has been shown in [40] that under Assumption 4, we have ρl(0,1)\rho_{l}\in(0,1).

Remark 2

When graph 𝒢SO\mathcal{G}_{SO} is strongly connected, all the agents form one cluster and achieve average consensus during the learning process. Algorithm 1 then reduces to a global consensus-based distributed RL algorithm. In fact, under any graph 𝒢SO\mathcal{G}_{SO}, the global consensus-based framework can always solve the distributed RL problem. However, when Assumption 1 holds, Algorithm 1 requires consensus to be achieved among smaller-size groups, therefore exhibiting a faster convergence rate. When the multi-agent network is of large scale, it is possible that the number of agents involved in each LVF is significantly smaller than the total number of agents in the whole network. In such scenarios, Algorithm 1 converges much faster than the global consensus-based algorithm due to two reasons: (i) the average consensus tasks are performed within smaller-size groups; (ii) the gradient estimation based on the LVF J^i(θ)\hat{J}_{i}(\theta) has a lower variance compared with that based on the GVF J(θ)J(\theta), see Remark 3 for more details.

IV-B Convergence Analysis

In this subsection, convergence analysis of Algorithm 1 will be presented. The following assumption is made to guarantee the solvability of the problem (2).

Assumption 5

The individual reward of each agent at any time tt is uniformly bounded, i.e., rlri(t)rur_{l}\leq r_{i}(t)\leq r_{u} for all i𝒱i\in\mathcal{V} and tt\in\mathbb{N}.

Lemma 5

Under Assumption 5, there exist JlJ_{l} and JuJ_{u} such that JlJi(θ)JuJ_{l}\leq J_{i}(\theta)\leq J_{u} for any θd\theta\in\mathbb{R}^{d}, i𝒱i\in\mathcal{V}.

Lemma 5 implies that there exists an optimal policy for the RL problem (4), which is the premise of solving problem (4). Based on Lemma 5, we can bound J^i\hat{J}_{i} and J=i𝒱JiJ=\sum_{i\in\mathcal{V}}J_{i} by [J^l,J^u][\hat{J}_{l},\hat{J}_{u}] and [Jl,Ju][J_{l},J_{u}], respectively. The following lemma bounds the error between the actual LVF and the expectation of observed LVF.

Lemma 6

Under Assumption 5, the following holds for all l𝒞l\in\mathcal{C} and i𝒱li\in\mathcal{V}_{l}:

|J^i(θ)𝔼[W^i(θ,ξ)]|nlγTeJ0,|\hat{J}_{i}(\theta)-\mathbb{E}[\hat{W}_{i}(\theta,\xi)]|\leq n_{l}\gamma^{T_{e}}J_{0}, (19)

where J0=max{|Jl|,|Ju|}=r01γJ_{0}=\max\{|J_{l}|,|J_{u}|\}=\frac{r_{0}}{1-\gamma}, r0=max{|rl|,|ru|}r_{0}=\max\{|r_{l}|,|r_{u}|\}, nl=|iL|=|lcl|n_{l}=|\mathcal{I}_{i}^{L}|=|\mathcal{I}_{l}^{cl}| is the number of agents involved in J^i(θ)\hat{J}_{i}(\theta).

Let μkl=(,μjkl,)jiL|iL|\mu^{kl}=(...,\mu_{j}^{kl},...)_{j\in\mathcal{I}_{i}^{L}}^{\top}\in\mathbb{R}^{|\mathcal{I}_{i}^{L}|}, the following lemma bounds the LVF estimation error.

Lemma 7

Under Assumptions 1, 3-5, by implementing Algorithm 1, the following inequalities hold for any l𝒞l\in\mathcal{C} and i𝒱li\in\mathcal{V}_{l}:

|𝔼[nlμikl(Tc)]J^i(θk+δuk)|Ei,|\mathbb{E}[n_{l}\mu_{i}^{kl}(T_{c})]-\hat{J}_{i}(\theta^{k}+\delta u^{k})|\leq E_{i}, (20)
𝔼ξk[[nlμikl(Tc)]2]Blμ,\mathbb{E}_{\xi^{k}\sim\mathcal{H}}\left[\left[n_{l}\mu_{i}^{kl}(T_{c})\right]^{2}\right]\leq B^{\mu}_{l}, (21)

where Ei=ρlTcnl2(JuJl+γTeJ0)+nl2γTeJ0E_{i}=\rho_{l}^{T_{c}}n_{l}^{2}(J_{u}-J_{l}+\gamma^{T_{e}}J_{0})+n_{l}^{2}\gamma^{T_{e}}J_{0}, Blμ=nl2(σ02+(1+γTe)2J02)B^{\mu}_{l}=n_{l}^{2}\left(\sigma_{0}^{2}+(1+\gamma^{T_{e}})^{2}J_{0}^{2}\right), σ0=maxi𝒱σi\sigma_{0}=\max_{i\in\mathcal{V}}\sigma_{i}.

The following lemma bounds the variance of the zeroth-order oracle (17).

Lemma 8

Under Assumptions 1, 3-5, for any i𝒱li\in\mathcal{V}_{l}, it holds that 𝔼[gi(θk,uk,ξk)2]Blμdiδ2\mathbb{E}[\|g_{i}(\theta^{k},u^{k},\xi^{k})\|^{2}]\leq\frac{B^{\mu}_{l}d_{i}}{\delta^{2}}.

Remark 3

(Low Gradient Estimation Variance Induced by LVFs) Lemma 8 shows that the variance of each local zeroth-order oracle is mainly associated with nln_{l} in BlμB_{l}^{\mu}, which is the number of agents involved in the LVF for the ll-th cluster. If the policy evaluation is based on the global reward, the bound of 𝔼[gi(θk,uk,ξk)2]\mathbb{E}[\|g_{i}(\theta^{k},u^{k},\xi^{k})\|^{2}] will be N2Blμdinl2δ2\frac{N^{2}B^{\mu}_{l}d_{i}}{n_{l}^{2}\delta^{2}}. When the network is of a large scale, NN may be significantly larger than nln_{l}. As a result, the variance of the zeroth-order oracle is much higher than that in our case. Therefore, our algorithm has a significantly improved scalability to large-scale networks.

Theorem 1

Under Assumptions 1-5, let δ=ϵLd\delta=\frac{\epsilon}{L\sqrt{d}}, η=ϵ1.5d1.5K\eta=\frac{\epsilon^{1.5}}{d^{1.5}\sqrt{K}}. The following statements hold:

(i). |Jδ(θ)J(θ)|ϵ|J^{\delta}(\theta)-J(\theta)|\leq\epsilon for any θd\theta\in\mathbb{R}^{d}.

(ii). By implementing Algorithm 1, if

Kd3B2ϵ5,Telogγϵ1.522n02LdJ0,Tclogρ0ϵ1.522n02Ld(JuJl+J0),\begin{split}&K\geq\frac{d^{3}B^{2}}{\epsilon^{5}},~{}~{}~{}~{}T_{e}\geq\log_{\gamma}\frac{\epsilon^{1.5}}{2\sqrt{2}n_{0}^{2}LdJ_{0}},\\ &T_{c}\geq\log_{\rho_{0}}\frac{\epsilon^{1.5}}{2\sqrt{2}n_{0}^{2}Ld(J_{u}-J_{l}+J_{0})},\end{split} (22)

then

1Kk=0K1𝔼[θJδ(θk)2]ϵ,\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla_{\theta}J^{\delta}(\theta^{k})\|^{2}]\leq\epsilon, (23)

where B=2(NJuJδ(θ0))+L4n02(σ02+(1+γTe)2J02)B=2(NJ_{u}-J^{\delta}(\theta^{0}))+L^{4}n_{0}^{2}(\sigma_{0}^{2}+(1+\gamma^{T_{e}})^{2}J_{0}^{2}), ρ0=maxl𝒞ρl\rho_{0}=\max_{l\in\mathcal{C}}\rho_{l}.

Remark 4

(Optimality Analysis) Theorem 1 (ii) implies the convergence to a stationary point of Jδ(θ)J^{\delta}(\theta), which is the smoothed value function101010The reason why we do not analyze the stationary point of J(θ)J(\theta) is that we did not assume J(θ)J(\theta) to be differentiable. Since Jδ(θ)J^{\delta}(\theta) is close to J(θ)J(\theta) (as shown in Theorem 1 (i)), an optimal policy for Jδ(θ)J^{\delta}(\theta) will be a near-optimal policy for J(θ)J(\theta). If we further assume J(θ)J(\theta) to have a Lipschitz continuous gradient, then the error of convergence to a stationary point of J(θ)J(\theta) can be obtained by quantifying the error between θJδ(θ)\nabla_{\theta}J^{\delta}(\theta) and θJ(θ)\nabla_{\theta}J(\theta).. When the MARL problem satisfies “gradient domination” and the policy parameterization is complete, a stationary point will always be the global optimum [41, 42]. Note that our formulation is general and contains the cases that do not satisfy gradient domination. For example, as a special case of our formulation, the linear optimal distributed control problem has many undesired stationary points [43].

Remark 5

(Sample Complexity Analysis) According to Theorem 1, the sample complexity of Algorithm 1 is 𝒪(n04J04ϵ5)\mathcal{O}(\frac{n_{0}^{4}J_{0}^{4}}{\epsilon^{5}}), which is worse than other ZOO-based algorithms in [44, Table 1], which is mainly caused by the use of one-point zeroth-order oracles and the mild assumption (non-smoothness and nonconvexity) on the value function. In Section VI, we will provide analysis on the advantage of using two-point zeroth-order oracles. Note that the lower bounds of KK, TeT_{e} and TcT_{c} are all positively associated with n0n_{0}, which is the maximal number of agents involved in one LVF. According to the definition of iL\mathcal{I}_{i}^{L} in (8), nln_{l} is determined by the length of the path starting from cluster ll in graph 𝒢SO\mathcal{G}_{SO}. This implies that the convergence rate depends on the maximal length of a path in 𝒢SO\mathcal{G}_{SO} and having shorter paths is beneficial for improving sample efficiency and enhancing the convergence rate.

When Assumption 1 is invalid, Lemmas 7 and 8, and Theorem 1 still hold. However, the LVF-based method becomes a GVF-based method and no longer exhibits advantages. In this case, there is only one cluster and each path achieves its maximum length, then the distributed RL algorithm becomes centralized and the sample complexity reaches maximum.

V Distributed RL via Truncated Local Value Functions

Even if Assumption 1 holds, the sample complexity of Algorithm 1 may still be high due to the large size of some SCC or some long paths in 𝒢SO\mathcal{G}_{SO}. In this section, motivated by [8], we resolve this issue by further dividing large size SCCs into smaller size SCCs (clusters) and ignoring agents that are far away when designing the LVF for each cluster in the graph. For each cluster, the approximation error turns out to be dependent on the distance between ignored agents and this cluster. Different from [8], where each agent neglects the effects of other agents that are far away, our design aims to make each cluster neglect its effects on other agents that are far away. Moreover, in our setting, the agents in each cluster estimate their common LVF value via local consensus, whereas in [8], each agent has its unique LVF, and it was not mentioned how this value can be obtained.

V-A Truncated Local Value Function Design

Different from the aforementioned SCC-based clustering for 𝒢SO\mathcal{G}_{SO}, now we artificially give a clustering 𝒱=l=1n𝒱l\mathcal{V}=\cup_{l=1}^{n}\mathcal{V}_{l}, where 𝒱l1𝒱l2=\mathcal{V}_{l_{1}}\cap\mathcal{V}_{l_{2}}=\varnothing for distinct l1,l2nl_{1},l_{2}\leq n, 𝒱l\mathcal{V}_{l} still corresponds to a SCC in 𝒢SO\mathcal{G}_{SO}. However, each cluster may no longer be a maximum SCC. That is, multiple clusters may form a larger SCC of 𝒢SO\mathcal{G}_{SO}.

Next we define a distance function D(i,j)D(i,j) to describe how many steps are needed for the action of agent ii to affect another agent jiLj\in\mathcal{I}_{i}^{L}. According to Lemma 1 (i), when (j,i)L(j,i)\in\mathcal{E}_{L}, there is always a path from ii to jj in 𝒢SOR\mathcal{G}_{SOR}. Let P(i,j)P(i,j) be the length of the shortest path 111111The length of a path refers to the number of edges included in this path. from vertex ii to vertex jj in graph 𝒢SOR\mathcal{G}_{SOR}. The distance function is defined as

D(i,j)={0,i=j,P(i,j),(j,i)L,,(j,i)L.D(i,j)=\left\{\begin{array}[]{lrlrlr}0,&i=j,\\ P(i,j),&(j,i)\in\mathcal{E}_{L},\\ \infty,&(j,i)\notin\mathcal{E}_{L}.\end{array}\right. (24)

We clarify the following facts regarding D(i,j)D(i,j). (i). It may happen that there is a path from ii to jj in 𝒢SOR\mathcal{G}_{SOR} but (j,i)L(j,i)\notin\mathcal{E}_{L}, see Fig. 3 as an example. Therefore, to exclude jiLj\notin\mathcal{I}_{i}^{L}, we artificially defined D(i,j)D(i,j) instead of using P(i,j)P(i,j) directly to characterize the inter-agent distance. (ii). The distance function D(i,j)D(i,j) defined here is unidirectional and does not satisfy the symmetry property of the distance in metric space. (iii). Although the artificial SCC clustering is obtained from 𝒢SO\mathcal{G}_{SO}, the path length P(i,j)P(i,j) is calculated via graph 𝒢SOR\mathcal{G}_{SOR} because it always contains all the edges from ii to any jiLj\in\mathcal{I}_{i}^{L}. If 𝒢SO\mathcal{G}_{SO} is used instead, some agent in jR\mathcal{I}_{j}^{R}, jiLj\in\mathcal{I}_{i}^{L} may be missing.

We further define the distance from a cluster ll to an agent jj as D(𝒱l,j)=mini𝒱lD(i,j)D(\mathcal{V}_{l},j)=\min_{i\in\mathcal{V}_{l}}D(i,j). Denote by Dl=maxj𝒱D(𝒱l,j)D_{l}^{*}=\max_{j\in\mathcal{V}}D(\mathcal{V}_{l},j) the maximum distance from cluster ll to any agent out of this cluster that can be affected by cluster ll. Since we have defined nl=|lcl|n_{l}=|\mathcal{I}_{l}^{cl}|, it is observed that Dl=nl|𝒱l|D_{l}^{*}=n_{l}-|\mathcal{V}_{l}|.

Given a cluster l𝒞l\in\mathcal{C}, for any agent i𝒱li\in\mathcal{V}_{l}, we define the following TLVF:

J~iδ(θ)={jiLJjδ(θ),κDl,j𝒱lκJjδ(θ),κ<Dl,\tilde{J}_{i}^{\delta}(\theta)=\left\{\begin{array}[]{lrlr}\sum_{j\in\mathcal{I}_{i}^{L}}J_{j}^{\delta}(\theta),&\kappa\geq D_{l}^{*},\\ \sum_{j\in\mathcal{V}_{l}^{\kappa}}J_{j}^{\delta}(\theta),&\kappa<D_{l}^{*},\end{array}\right. (25)

where 𝒱lκ={j𝒱:D(𝒱l,j)κ}\mathcal{V}_{l}^{\kappa}=\{j\in\mathcal{V}:D(\mathcal{V}_{l},j)\leq\kappa\} is the set of agents involved in the TLVF of cluster ll, κ+\kappa\in\mathbb{N}_{+} is a pre-specified truncation index describing the maximum distance from each cluster ll within which the agents are taken into account in the TLVF of cluster ll. Similarly, we define J~i(θ)=jiLJj(θ)\tilde{J}_{i}(\theta)=\sum_{j\in\mathcal{I}_{i}^{L}}J_{j}(\theta) if κDl\kappa\geq D_{l}^{*}, and J~i(θ)=j𝒱lκJj(θ)\tilde{J}_{i}(\theta)=\sum_{j\in\mathcal{V}_{l}^{\kappa}}J_{j}(\theta) otherwise, W~i(θ)=jiLWj(θ)\tilde{W}_{i}(\theta)=\sum_{j\in\mathcal{I}_{i}^{L}}W_{j}(\theta) if κDl\kappa\geq D_{l}^{*}, and W~i(θ)=j𝒱lκWj(θ)\tilde{W}_{i}(\theta)=\sum_{j\in\mathcal{V}_{l}^{\kappa}}W_{j}(\theta) otherwise.

The following lemma bounds the error between the local gradients of the TLVF and the GVF.

Lemma 9

Under Assumption 2, given cluster ll and agent i𝒱li\in\mathcal{V}_{l}, the following bound holds for J~iδ(θ)\tilde{J}^{\delta}_{i}(\theta):

θiJ~iδ(θ)θiJδ(θ)γκ+1j𝒱¯lκLjddi,\|\nabla_{\theta_{i}}\tilde{J}_{i}^{\delta}(\theta)-\nabla_{\theta_{i}}J^{\delta}(\theta)\|\leq\gamma^{\kappa+1}\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}L_{j}\sqrt{dd_{i}}, (26)

where 𝒱¯lκ={j𝒱:κ<D(𝒱l,j)<}=𝒱l𝒱lκ\bar{\mathcal{V}}_{l}^{\kappa}=\{j\in\mathcal{V}:\kappa<D(\mathcal{V}_{l},j)<\infty\}=\mathcal{V}_{l}\setminus\mathcal{V}_{l}^{\kappa}.

Lemma 9 implies that the error between θiJ~iδ(θ)\nabla_{\theta_{i}}\tilde{J}_{i}^{\delta}(\theta) and θiJδ(θ)\nabla_{\theta_{i}}J^{\delta}(\theta) exponentially decays with the exponent κ\kappa. Therefore, when γκ+1\gamma^{\kappa+1} is sufficiently small, by employing θiJ~iδ(θ)\nabla_{\theta_{i}}\tilde{J}_{i}^{\delta}(\theta) in the gradient ascent algorithm, the induced error should be acceptable. This is the fundamental idea of our approach. In next subsection, we will propose the detailed algorithm design and convergence analysis.

V-B Distributed RL with Convergence Analysis

Next we design a distributed RL algorithm based on the TLVF. It suffices to redesign the communication weights, so that the value of J~iδ(θ)\tilde{J}_{i}^{\delta}(\theta) (instead of J^iδ(θ)\hat{J}^{\delta}_{i}(\theta)) can be estimated for each agent i𝒱i\in\mathcal{V}. For any cluster l𝒞l\in\mathcal{C}, instead of using lcl\mathcal{I}_{l}^{cl}, we set the index set of agents involved in the LVF as lκ=lcl𝒱lκ\mathcal{I}_{l}^{\kappa}=\mathcal{I}_{l}^{cl}\cap\mathcal{V}_{l}^{\kappa}.

Define the number of agents involved in each TLVF as nlκn_{l}^{\kappa}. Then we have nlκnln_{l}^{\kappa}\leq n_{l}, and the equality holds if κDl\kappa\geq D_{l}^{*}.

The ll-th communication weight matrix is then redesigned as

Cijl,κ{>0,ifi,jlκ,(i,j)C;=0,otherwise,l𝒞,C_{ij}^{l,\kappa}\left\{\begin{aligned} &>0,~{}~{}\text{if}~{}i,j\in\mathcal{I}_{l}^{\kappa},(i,j)\in\mathcal{E}_{C};\\ &=0,~{}~{}\text{otherwise},\end{aligned}\right.~{}~{}~{}~{}l\in\mathcal{C}, (27)

where 𝒞={1,,n}\mathcal{C}=\{1,...,n\}.

Similar to Assumptions 3 and 4, we make the following two assumptions.

Assumption 6

The communication graph 𝒢C\mathcal{G}_{C} is undirected, and the agents specified by lκ\mathcal{I}_{l}^{\kappa} form a connected component of 𝒢C\mathcal{G}_{C} for all l𝒞l\in\mathcal{C}.

Assumption 7

Cl,κC^{l,\kappa} is doubly stochastic for all l𝒞l\in\mathcal{C}.

Note that Assumption 6 is milder than Assumption 3 because lκlcl\mathcal{I}_{l}^{\kappa}\subseteq\mathcal{I}_{l}^{cl}, implying that fewer communication links are needed when the TLVF method is employed. Moreover, when the communication graph 𝒢C\mathcal{G}_{C} is available, κ\kappa can be designed to meet Assumption 7 .

The distributed RL algorithm based on TLVFs can be obtained by simply replacing the communication weight matrices ClC^{l} with Cl,κC^{l,\kappa}, for all l𝒞l\in\mathcal{C}.

Similar to Lemmas 6, 7 and 8, we have the following results.

Lemma 10

Under Assumption 5, the following holds for all l𝒞l\in\mathcal{C} and i𝒱li\in\mathcal{V}_{l}:

|J~i(θ)𝔼[W~i(θ,ξ)]|nlκγTeJ0.|\tilde{J}_{i}(\theta)-\mathbb{E}[\tilde{W}_{i}(\theta,\xi)]|\leq n_{l}^{\kappa}\gamma^{T_{e}}J_{0}. (28)
Lemma 11

Under Assumptions 1, 5-7, by implementing Algorithm 1, the following inequalities hold for any l𝒞l\in\mathcal{C} and i𝒱li\in\mathcal{V}_{l}:

|𝔼[nlμikl(Tc)]J~i(θk+δuk)|Eiκ,|\mathbb{E}[n_{l}\mu_{i}^{kl}(T_{c})]-\tilde{J}_{i}(\theta^{k}+\delta u^{k})|\leq E_{i}^{\kappa}, (29)
𝔼ξk[[nlμikl(Tc)]2]Blκ,\mathbb{E}_{\xi^{k}\sim\mathcal{H}}\left[\left[n_{l}\mu_{i}^{kl}(T_{c})\right]^{2}\right]\leq B^{\kappa}_{l}, (30)

where Eiκ=(ρlκ)Tc(nlκ)2(JuJl+γTeJ0)+(nlκ)2γTeJ0E_{i}^{\kappa}=(\rho_{l}^{\kappa})^{T_{c}}(n_{l}^{\kappa})^{2}\left(J_{u}-J_{l}+\gamma^{T_{e}}J_{0}\right)+(n_{l}^{\kappa})^{2}\gamma^{T_{e}}J_{0}, Blκ=(nlκ)2(σ02+(1+γTe)2J02)B^{\kappa}_{l}=(n_{l}^{\kappa})^{2}(\sigma_{0}^{2}+(1+\gamma^{T_{e}})^{2}J_{0}^{2}), σ0=maxi𝒱σi\sigma_{0}=\max_{i\in\mathcal{V}}\sigma_{i}, ρlκ=C0l,κ1nlκ𝟏nlκ𝟏nlκ\rho_{l}^{\kappa}=\|C^{l,\kappa}_{0}-\frac{1}{n_{l}^{\kappa}}\mathbf{1}_{n_{l}^{\kappa}}\mathbf{1}_{n_{l}^{\kappa}}^{\top}\|.

Lemma 12

Under Assumptions 1, 5-7, for any i𝒱li\in\mathcal{V}_{l}, it holds that 𝔼[gi(θk,uk,ξk)2]Blκdiδ2\mathbb{E}[\|g_{i}(\theta^{k},u^{k},\xi^{k})\|^{2}]\leq\frac{B^{\kappa}_{l}d_{i}}{\delta^{2}}.

Theorem 2

Under Assumptions 1, 2, 5-7, let δ=ϵLd\delta=\frac{\epsilon}{L\sqrt{d}}, η=ϵ1.5d1.5K\eta=\frac{\epsilon^{1.5}}{d^{1.5}\sqrt{K}}. The following statements hold:

(i). |Jδ(θ)J(θ)|ϵ|J^{\delta}(\theta)-J(\theta)|\leq\epsilon for any θd\theta\in\mathbb{R}^{d}.

(ii). By implementing Algorithm 1, if

Kd3B2ϵ5,Telogγϵ1.54(n0κ)2LdJ0,Tclogρ0ϵ1.54(n0κ)2Ld(JuJl+J0),\begin{split}&K\geq\frac{d^{3}B^{2}}{\epsilon^{5}},~{}~{}~{}~{}T_{e}\geq\log_{\gamma}\frac{\epsilon^{1.5}}{4(n_{0}^{\kappa})^{2}LdJ_{0}},\\ &T_{c}\geq\log_{\rho_{0}}\frac{\epsilon^{1.5}}{4(n_{0}^{\kappa})^{2}Ld(J_{u}-J_{l}+J_{0})},\end{split} (31)

then

1Kk=0K1𝔼[θJδ(θk)2]ϵ+γκ+1maxl𝒞|𝒱¯lκ|L0dd0,\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla_{\theta}J^{\delta}(\theta^{k})\|^{2}]\leq\epsilon+\gamma^{\kappa+1}\max_{l\in\mathcal{C}}|\bar{\mathcal{V}}_{l}^{\kappa}|L_{0}\sqrt{dd_{0}}, (32)

where B=2(NJuJδ(θ0))+(n0κ)2(σ02+(1+γTe)2J02)L4B=2(NJ_{u}-J^{\delta}(\theta^{0}))+(n_{0}^{\kappa})^{2}(\sigma_{0}^{2}+(1+\gamma^{T_{e}})^{2}J_{0}^{2})L^{4}, n0κ=maxl𝒞nlκn_{0}^{\kappa}=\max_{l\in\mathcal{C}}n_{l}^{\kappa}.

Remark 6

(Sample Complexity Analysis) The sample complexity provided in Theorem 2 is associated with n0κn_{0}^{\kappa}, which may be significantly smaller than n0n_{0} (depending on the choice of κ\kappa) in Theorem 1. On the other hand, the convergence error in Theorem 2 has an extra term associated with γκ+1\gamma^{\kappa+1}. Therefore, there is a trade-off when we choose κ\kappa. The greater κ\kappa is, the smaller the convergence error will be, however, the convergence rate may be decreased. For example, when γ=0.6\gamma=0.6, we have γκ+10.028\gamma^{\kappa+1}\leq 0.028 if κ6\kappa\geq 6. In this case, we can choose κ=6\kappa=6, implying that each cluster only consider its effects on 6 agents other than this cluster even when 𝒢SO\mathcal{G}_{SO} is strongly connected. Therefore, when the network is of a huge scale with long paths in graph 𝒢SO\mathcal{G}_{SO}, using the TLVFs can further reduce the sample complexity.

VI Variance Reduction by Two-Point Zeroth-Order Oracles

The two distributed RL algorithms proposed in the last two sections are based on the one-point zeroth-order oracle (17). We observe that Algorithm 1 is always efficient as long as g(θk,uk,ξk)g(\theta^{k},u^{k},\xi^{k}) is an unbiased estimate of θJδ(θk)\nabla_{\theta}J^{\delta}(\theta^{k}) and 𝔼[g(θk,uk,ξk)2]\mathbb{E}[\|g(\theta^{k},u^{k},\xi^{k})\|^{2}] is bounded. Therefore, the two-point feedback oracles proposed in [36] and the residual feedback oracle in [44] can also be employed in Algorithm 1. In this section, we will give a brief analysis on how the two-point zeroth-order oracle further reduces the gradient estimation variance.

Based on the LVF design in our work, the two-point feedback oracle for each agent ii at learning episode kk can be obtained as

g¯i(θk,uk,ξk)=μikli(Tc)νikli(Tc)δnliuik,\bar{g}_{i}(\theta^{k},u^{k},\xi^{k})=\frac{\mu_{i}^{kl_{i}}(T_{c})-\nu_{i}^{kl_{i}}(T_{c})}{\delta}n_{l_{i}}u^{k}_{i}, (33)

where νikli(Tc)\nu_{i}^{kl_{i}}(T_{c}) is the approximate estimation of W^i(θk,ζk)/nli\hat{W}_{i}(\theta^{k},\zeta^{k})/n_{l_{i}} via local consensus.

Define L^i=jiLLi\hat{L}_{i}=\sum_{j\in\mathcal{I}_{i}^{L}}L_{i}, which is a Lipschitz constant of W^i\hat{W}_{i}. Then we can show (34),

𝔼[g¯i(θk,uk,ξk)2]=𝔼[J^i(θk+δuk)J^i(θk)+nliμikli(Tc)J^i(θk+δuk)+J^i(θk)nliνikli(Tc)δuik2]=𝔼[J^i(θk+δuk)J^i(θk)+𝔼[nliμikli(Tc)]J^i(θk+δuk)+J^i(θk)𝔼[nliνikli(Tc)]+jiL(ξjkζjk)δuik2]2𝔼[L^i2uk2uik2]+2((2Ei)2+2nlσ02)𝔼[uik2/δ2]2L^i2(𝔼[ui4]+𝔼[j𝒱{i}uj2]𝔼[ui2])+4(2Ei2+nlσ02)di/δ22L^i2((di+4)2+(ddi)di)+4(2Ei2+nlσ02)di/δ2=2L^i2(did+8di+16)+4(2Ei2+nlσ02)di/δ2,\begin{split}&\mathbb{E}[\|\bar{g}_{i}(\theta^{k},u^{k},\xi^{k})\|^{2}]=\mathbb{E}[\|\frac{\hat{J}_{i}(\theta^{k}+\delta u^{k})-\hat{J}_{i}(\theta^{k})+n_{l_{i}}\mu_{i}^{kl_{i}}(T_{c})-\hat{J}_{i}(\theta^{k}+\delta u^{k})+\hat{J}_{i}(\theta^{k})-n_{l_{i}}\nu_{i}^{kl_{i}}(T_{c})}{\delta}u_{i}^{k}\|^{2}]\\ &=\mathbb{E}[\|\frac{\hat{J}_{i}(\theta^{k}+\delta u^{k})-\hat{J}_{i}(\theta^{k})+\mathbb{E}[n_{l_{i}}\mu_{i}^{kl_{i}}(T_{c})]-\hat{J}_{i}(\theta^{k}+\delta u^{k})+\hat{J}_{i}(\theta^{k})-\mathbb{E}[n_{l_{i}}\nu_{i}^{kl_{i}}(T_{c})]+\sum_{j\in\mathcal{I}_{i}^{L}}(\xi_{j}^{k}-\zeta_{j}^{k})}{\delta}u_{i}^{k}\|^{2}]\\ &\leq 2\mathbb{E}[\hat{L}_{i}^{2}\|u^{k}\|^{2}\|u_{i}^{k}\|^{2}]+2\left((2E_{i})^{2}+2n_{l}\sigma_{0}^{2}\right)\mathbb{E}[\|u_{i}^{k}\|^{2}/\delta^{2}]\\ &\leq 2\hat{L}_{i}^{2}\left(\mathbb{E}[\|u_{i}\|^{4}]+\mathbb{E}[\sum_{j\in\mathcal{V}\setminus\{i\}}\|u_{j}\|^{2}]\mathbb{E}[\|u_{i}\|^{2}]\right)+4(2E_{i}^{2}+n_{l}\sigma_{0}^{2})d_{i}/\delta^{2}\\ &\leq 2\hat{L}_{i}^{2}\left((d_{i}+4)^{2}+(d-d_{i})d_{i}\right)+4(2E_{i}^{2}+n_{l}\sigma_{0}^{2})d_{i}/\delta^{2}\\ &=2\hat{L}_{i}^{2}(d_{i}d+8d_{i}+16)+4(2E_{i}^{2}+n_{l}\sigma_{0}^{2})d_{i}/\delta^{2},\end{split} (34)

where ξ=(ξ1,,ξN)\xi=(\xi_{1},...,\xi_{N})^{\top} and ζ=(ζ1,,ζN)\zeta=(\zeta_{1},...,\zeta_{N})^{\top} are the noises in the observations Wi(θk+δuk,ξik)W_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k}) and Wi(θk,ζk)W_{i}(\theta^{k},\zeta^{k}), respectively, i.e., Wi(θk+δuk,ξik)=𝔼[Wi(θk+δuk,ξik)]+ξiW_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k})=\mathbb{E}[W_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k})]+\xi_{i}, Wi(θk,ζk)=𝔼[Wi(θk,ζk)]+ζiW_{i}(\theta^{k},\zeta^{k})=\mathbb{E}[W_{i}(\theta^{k},\zeta^{k})]+\zeta_{i}; the first inequality used the bound EiE_{i} in (20) and the assumptions 𝔼[ξi]=𝔼[ζi]=0\mathbb{E}[\xi_{i}]=\mathbb{E}[\zeta_{i}]=0 and 𝔼[ξi2]σ0\mathbb{E}[\xi_{i}^{2}]\leq\sigma_{0}, i𝒱i\in\mathcal{V}.

Comparisons with one-point feedback. Note that EiE_{i} in (20) can be arbitrarily small as long as TeT_{e} and TcT_{c} are sufficiently large. Let us first consider the ideal case where the consensus estimation is perfect Tc=T_{c}=\infty and the observation is exact, i.e., Wi=𝔼[Wi]=JiW_{i}=\mathbb{E}[W_{i}]=J_{i} (Te=T_{e}=\infty and ξ=0\xi=0), then Ei=0E_{i}=0 and σ0=0\sigma_{0}=0. As a result, the upper bound of 𝔼[g¯i(θk,uk,ξk)2]\mathbb{E}[\|\bar{g}_{i}(\theta^{k},u^{k},\xi^{k})\|^{2}] is independent of δ\delta, whereas the upper bound of 𝔼[gi(θk,uk,ξk)2]\mathbb{E}[\|g_{i}(\theta^{k},u^{k},\xi^{k})\|^{2}] becomes nl2J02di/δ2n_{l}^{2}J_{0}^{2}d_{i}/\delta^{2}, as shown in Lemma 8. This implies that the variance of the two-point zeroth-order oracle is independent of the reward value and the maximum path length n0n_{0}, thus is more scalable than the one-point feedback. Now we consider a more practical scenario where both the consensus estimations and the observations are inexact. For convenience of analysis, we consider δ>0\delta>0 as an infinitesimal quantity and neglect terms in the upper bounds independent of the network scale. Then we have 𝔼[g¯i(θk,uk,ξk)2]=𝒪(nldi/δ2)\mathbb{E}[\|\bar{g}_{i}(\theta^{k},u^{k},\xi^{k})\|^{2}]=\mathcal{O}(n_{l}d_{i}/\delta^{2}). In Lemma 8, we showed that the variance bound for the zeroth-order oracle with one-point feedback is 𝒪(nl2di/δ2)\mathcal{O}(n_{l}^{2}d_{i}/\delta^{2}). Therefore, when δ>0\delta>0 is small enough, the two-point zeroth-order oracle still outperforms the one-point feedback scheme in terms of lower variance and faster convergence speed.

VII Simulation Results

In this section, we present two examples, where the first one shows the results of applying Algorithm 1 with the communication weight matrices (15) to the resource allocation problem in Example 1, and the second one shows the results of applying Algorithm 1 with the communication weight matrices (27) to a large-scale network scenario.

Example 2

We employ the distributed RL with LVFs in (10) to solve the problem in Example 1. To seek the optimal policy πi(oi)\pi_{i}(o_{i}) for agent ii to determine its action {bij}j𝒩iS+\{b_{ij}\}_{j\in\mathcal{N}_{i}^{S+}}, we adopt the following parameterization for the policy function:

bij=exp(zij)jiexp(zij),b_{ij}=\frac{\exp(-z_{ij})}{\sum_{j\in\mathcal{I}_{i}}\exp(-z_{ij})}, (35)

where zijz_{ij} is approximated by radial basis functions:

zij=k=1ncoicik2θij(k),z_{ij}=\sum_{k=1}^{n_{c}}\|o_{i}-c_{ik}\|^{2}\theta_{ij}(k), (36)

cik=(c^ik,c¯ik)|iO|+1c_{ik}=(\hat{c}_{ik}^{\top},\bar{c}_{ik})\in\mathbb{R}^{|\mathcal{I}_{i}^{O}|+1} is the center of the kk-th feature for agent ii, here c^ik|iO|\hat{c}_{ik}\in\mathbb{R}^{|\mathcal{I}_{i}^{O}|} and c¯ik\bar{c}_{ik} are set according to the ranges of mim_{\mathcal{I}_{i}} and did_{i}, respectively, such that cikc_{ik}, k=1,,nck=1,...,n_{c} are approximately evenly distributed in the range of oio_{i}.

Set mi(0)=1+χim_{i}(0)=1+\chi_{i} for all i=1,,9i=1,...,9, χi\chi_{i} and wiw_{i} are both set as random variables following the Guassian distribution with mean 0 amd variance 0.01 truncated to [0.01,0.01][-0.01,0.01], the number of evolution steps Te=10T_{e}=10, and the number of learning epochs K=1500K=1500, yi(t)di(t)=0.5sinty_{i}(t)-d_{i}(t)=0.5\sin t. The communication graph 𝒢C=(𝒱,C)\mathcal{G}_{C}=(\mathcal{V},\mathcal{E}_{C}) is set as 𝒢C=𝒢SO𝒢SO\mathcal{G}_{C}=\mathcal{G}_{SO}\cup\mathcal{G}_{SO}^{\top}, which satisfies Assumption 3. Let GCG_{C} be the 0-1weighted matrix of graph 𝒢C\mathcal{G}_{C}, that is, GC(i,j)=1G_{C}(i,j)=1 if (i,j)C(i,j)\in\mathcal{E}_{C} and GC(i,j)=0G_{C}(i,j)=0 otherwise. Let diC=(j,i)CGC(i,j)d_{i}^{C}=\sum_{(j,i)\in\mathcal{E}_{C}}G_{C}(i,j). The communication weights are set as Metropolis weights [45]:

Cijl={11+max{diC,djC},ifij,i,jlcl,(i,j)C;1jiCijl,ifi=j;0,otherwise,C^{l}_{ij}=\left\{\begin{aligned} &\frac{1}{1+\max\{d_{i}^{C},d_{j}^{C}\}},~{}~{}\text{if}~{}i\neq j,~{}i,j\in\mathcal{I}_{l}^{cl},(i,j)\in\mathcal{E}_{C};\\ &1-\sum_{j\neq i}C^{l}_{ij},~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{if}~{}i=j;\\ &0,~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{otherwise},\end{aligned}\right.~{}~{}~{}~{} (37)

where l𝒞l\in\mathcal{C}.

By further setting the consensus iteration number Tc=10T_{c}=10, η=0.01\eta=0.01, and δ=2\delta=2, Fig. 5 (left) depicts the evolution of the observed values of the GVF by implementing 4 different RL algorithms. The two boundaries of the shaded area are obtained by running each RL algorithm for 10 times and taking the upper bound and lower bound of W(θk,ξ)W(\theta^{k},\xi^{*}) in each learning episode. Here ξ\xi^{*} denotes the specific noise generated in the simulation, and is different in different learning processes. In each time of implementation, one perturbation vector uku^{k} is sampled and used for all the 4 algorithms during each learning episode kk. The centralized algorithm is the zeroth-order optimization algorithm based on global value evaluation, while the distributed algorithm is based on local value evaluation (Algorithm 1). The distributed two-point feedback algorithm is Algorithm 1 with gi(θk,uk,ξk)g_{i}(\theta^{k},u^{k},\xi^{k}) replaced by g¯i(θk,uk,ξk)\bar{g}_{i}(\theta^{k},u^{k},\xi^{k}) in (33). We observe that the distributed algorithms are always faster than the centralized algorithms. Fig. 5 (middle) and Fig. 5 (right) show the comparison of centralized and distributed one-point feedback algorithms, and the comparison of centralized and distributed two-point feedback algorithms, respectively. From these two figures, it is clear that the distributed algorithms always exhibit lower variances in contrast to the centralized algorithms. This implies that policy evaluation based on LVFs is more robust than that based on the GVF.

Refer to caption
Figure 5: (Left) Comparison of different algorithms for 9 warehouses; (Middle) centralized and distributed algorithms under zeroth-order oracles with one-point feedback; (Right) centralized and distributed algorithms under zeroth-order oracles with two-point feedback. The observed GVF value W(θk,ξ)=i𝒱Wi(θk,ξi)W(\theta^{k},\xi^{*})=\sum_{i\in\mathcal{V}}W_{i}(\theta^{k},\xi_{i}^{*}) is employed as the performance metric. The boundaries of the shaded area are obtained by running each RL algorithm for 10 times and taking the upper bound and lower bound of W(θk,ξ)W(\theta^{k},\xi^{*}) in each learning episode.
Example 3

Next, in a setting similar to Example 1, we consider an extendable example with NN warehouses. By regarding 1 and N+1N+1 as the same warehouse, we set

S=O={(i,j)𝒱2:|ij|=1,i=2k1,k}.\mathcal{E}_{S}=\mathcal{E}_{O}=\{(i,j)\in\mathcal{V}^{2}:|i-j|=1,i=2k-1,k\in\mathbb{N}\}.

The reward graph is set as 𝒢R=𝒢S\mathcal{G}_{R}=\mathcal{G}_{S}^{\top}. According to the definitions introduced in Subsection III-A, we have iL={i,i+1,i+2,i1,i2}\mathcal{I}_{i}^{L}=\{i,i+1,i+2,i-1,i-2\} if ii is odd, iL={i,i+1,i1}\mathcal{I}_{i}^{L}=\{i,i+1,i-1\} if ii is even, for all i𝒱i\in\mathcal{V}.

The communication graph 𝒢C=(𝒱,C)\mathcal{G}_{C}=(\mathcal{V},\mathcal{E}_{C}) is set as 𝒢C=𝒢L\mathcal{G}_{C}=\mathcal{G}_{L}, implying that each agent can estimate its LVF value without using the local consensus algorithm. The learning iteration step is set as η=0.05\eta=0.05. Other parameter settings are the same as those in Example 2. By implementing four different RL algorithms, Fig. 6 shows the results for N=20N=20, N=40N=40, and N=80N=80, respectively. Observe that the convergence time for distributed algorithms remain almost invariant for networks with different scales, whereas the centralized algorithms converge much slower when the network scale is increased. Moreover, the two-point oracle always outperforms the one-point oracle in terms of lower variance and faster convergence. These observations are consistent with our analysis in Remark 2 and Section VI.

Refer to caption
Figure 6: Comparison of different RL algorithms for Example 3 with N=20N=20, N=40N=40, and N=80N=80, respectively.
Refer to caption
Figure 7: Comparison of Algorithm 1 based on TLVFs with κ=0\kappa=0, 1, and 2, respectively, for Example 4.
Example 4

Now we consider N=100N=100 warehouses with connected undirected state graph and observation graph,

S=O={(i,j)𝒱2:|ij|=1},\mathcal{E}_{S}=\mathcal{E}_{O}=\{(i,j)\in\mathcal{V}^{2}:|i-j|=1\},

where warehouse 1 is also viewed as warehouse N+1N+1.

The edge set of graph 𝒢R=(𝒱,R)\mathcal{G}_{R}=(\mathcal{V},\mathcal{E}_{R}) is set as R=\mathcal{E}_{R}=\varnothing, implying that the individual reward of each agent only depends on its own state and action. In this case, the learning graph 𝒢L\mathcal{G}_{L} is complete because iSOji\stackrel{{\scriptstyle\mathcal{E}_{SO}}}{{\longrightarrow}}j for any i,j𝒱i,j\in\mathcal{V}. Then Algorithm 1 with the LVF setting in (10) becomes a centralized algorithm. Hence, we employ the TLVF defined in (25). The communication graph is set as 𝒢C=𝒢S\mathcal{G}_{C}=\mathcal{G}_{S}. By setting 𝒱l={4l3,,4l}\mathcal{V}_{l}=\{4l-3,...,4l\}, l=1,,25l=1,...,25, and choosing same parameters mi(0)m_{i}(0), TeT_{e}, zi(t)z_{i}(t) as those in Example 2, the simulation results for κ=0\kappa=0, 1 and 2 are shown in Fig. 7. We observe that the distributed RL algorithm with κ=0\kappa=0 gains the lowest variance and fastest convergence rate. This means that in this example, the TLVF approximation error does not harm the improved performance of our RL algorithm. Moreover, the smaller κ\kappa is, the faster the algorithm converges, which is consistent with the analysis in Remark 6.

VIII Conclusions

We have recognized three graphs inherently embedded in MARL, namely, the state graph, observation graph, and reward graph. A connection between these three graphs and the learning graph was established, based on which we proposed our distributed RL algorithm via LVFs and derived conditions on the communication graph required in RL. It was shown that the LVFs constructed based on the aforementioned 3 graphs are able to play the same role as the GVF in gradient estimation. To adapt our algorithm to MARL with general graphs, we have designed TLVFs associated with an artificially specified index. The choice of this index is a trade-off between variance reduction and gradient approximation errors. Simulation examples have shown that our proposed algorithms with LVFs or TLVFs significantly outperform RL algorithms based on GVF, especially for large-scale MARL problems.

The RL algorithms proposed in this work are policy gradient algorithms based on ZOO, which are general, but may be not the best choice for specific applications. In the future, we are interested in exploring how our graph-theoretic approach can be combined with other RL techniques to facilitate learning for large-scale network systems.

IX Appendix A: Proofs of Lemmas and Theorems

Proof of Lemma 1. (i). Suppose that iL=𝒱\mathcal{I}_{i}^{L}=\mathcal{V} for all i𝒱i\in\mathcal{V}. Since 𝒢SOR\mathcal{G}_{SOR} has n>1n>1 SCCs, there must exist distinct i,j𝒱i,j\in\mathcal{V} such that jj is not reachable from ii in 𝒢SOR\mathcal{G}_{SOR}. However, iL=𝒱\mathcal{I}_{i}^{L}=\mathcal{V} implies that there exists k𝒱k\in\mathcal{V} such that jkR+j\in\mathcal{I}_{k}^{R+}, and kiSOk\in\mathcal{R}_{i}^{SO}, implying that jj is reachable from ii in 𝒢SO𝒢R\mathcal{G}_{SO}\cup\mathcal{G}_{R}, which is a contradiction.

(ii). Suppose 𝒢SO\mathcal{G}_{SO} only has 1 SCC. According to (6), iSO=𝒱\mathcal{R}_{i}^{SO}=\mathcal{V} for any i𝒱i\in\mathcal{V}. This implies that iL=𝒱\mathcal{I}_{i}^{L}=\mathcal{V} for any i𝒱i\in\mathcal{V}, which contradicts with Assumption 1. \blacksquare

Proof of Lemma 2. Define

J¯i(θ)=J(θ)J^i(θ)=j𝒱iL𝔼s0𝒟Vjπ(θ)(s0).\bar{J}_{i}(\theta)=J(\theta)-\hat{J}_{i}(\theta)=\sum_{j\in\mathcal{V}\setminus\mathcal{I}_{i}^{L}}\mathbb{E}_{s_{0}\sim\mathcal{D}}V_{j}^{\pi(\theta)}(s_{0}). (38)

Next we show that J¯i(θ)\bar{J}_{i}(\theta) is independent of θi\theta_{i}.

Let iSO={j𝒱:jSOi}{i}\mathcal{R}^{SO-}_{i}=\{j\in\mathcal{V}:j\stackrel{{\scriptstyle\mathcal{E}_{SO}}}{{\longrightarrow}}i\}\cup\{i\} be the set of vertices in graph 𝒢SO\mathcal{G}_{SO} that can reach ii and vertex ii. Note that for each agent jj, its action at time tt, i.e., aj(t)a_{j}(t), is only affected by the partial observation oj(t)o_{j}(t), the current state sj(t)s_{j}(t), and policy θj\theta_{j}. Therefore, there exists a function fj:𝒪j×djP(𝒜j)f_{j}:\mathcal{O}_{j}\times\mathbb{R}^{d_{j}}\rightarrow P(\mathcal{A}_{j}) such that

aj(t)fj(oj(t),θj)=fj({sk(t)}kjO,θj).\begin{split}a_{j}(t)\sim f_{j}(o_{j}(t),\theta_{j})=f_{j}(\{s_{k}(t)\}_{k\in\mathcal{I}_{j}^{O}},\theta_{j}).\end{split} (39)

Similarly, according to the definition of 𝒫i\mathcal{P}_{i}, there exists another function hj:ΠkjS𝒮k×ΠkjS𝒜kP(𝒮j)h_{j}:\Pi_{k\in\mathcal{I}_{j}^{S}}\mathcal{S}_{k}\times\Pi_{k\in\mathcal{I}_{j}^{S}}\mathcal{A}_{k}\rightarrow P(\mathcal{S}_{j}) such that

sj(t)hj({sk(t1)}kjS,{ak(t1)}kjS),s_{j}(t)\sim h_{j}(\{s_{k}(t-1)\}_{k\in\mathcal{I}_{j}^{S}},\{a_{k}(t-1)\}_{k\in\mathcal{I}_{j}^{S}}), (40)

together with (39), we have

sj(t)hj({sk(t1)}kjSO,{θl}ljS),s_{j}(t)\sim h_{j}(\{s_{k}(t-1)\}_{k\in\mathcal{I}_{j}^{SO}},\{\theta_{l}\}_{l\in\mathcal{I}_{j}^{S}}), (41)

and

aj(t)fj({hk({sl1(t1)}l1kSO,{θl2}l2kS)}kjO,θj),\begin{split}a_{j}(t)\sim f_{j}(\{h_{k}(\{s_{l_{1}}(t-1)\}_{l_{1}\in\mathcal{I}_{k}^{SO}},\{\theta_{l_{2}}\}_{l_{2}\in\mathcal{I}_{k}^{S}})\}_{k\in\mathcal{I}_{j}^{O}},\theta_{j}),\end{split} (42)

where jSO=jSjO\mathcal{I}_{j}^{SO}=\mathcal{I}_{j}^{S}\cup\mathcal{I}_{j}^{O}.

According to (41) and (42), we conclude that {sj(t),aj(t)}\{s_{j}(t),a_{j}(t)\} is affected by θl\theta_{l} only if ljSOl\in\mathcal{R}_{j}^{SO-}. As a result, {sk(t),ak(t)}kjR\{s_{k}(t),a_{k}(t)\}_{k\in\mathcal{I}_{j}^{R}} is affected by θl\theta_{l} only if lkjRkSOAjl\in\cup_{k\in\mathcal{I}_{j}^{R}}\mathcal{R}_{k}^{SO-}\triangleq A_{j}.

Next we show once jiLj\notin\mathcal{I}_{i}^{L}, it must hold that iAji\notin A_{j}, i.e., θi\theta_{i} will not affect {sk(t),ak(t)}kjR\{s_{k}(t),a_{k}(t)\}_{k\in\mathcal{I}_{j}^{R}}. By the definition in (8), jiLj\notin\mathcal{I}_{i}^{L} implies that jRiSO=\mathcal{I}_{j}^{R}\cap\mathcal{R}_{i}^{SO}=\varnothing. That is, there are no vertices in jR\mathcal{I}_{j}^{R} that are reachable from vertex ii in graph 𝒢SO\mathcal{G}_{SO}. As a result, iAji\notin A_{j}.

Then we conclude that θi\theta_{i} never influences rj(siR(t),aiR(t))r_{j}(s_{\mathcal{I}_{i}^{R}}(t),a_{\mathcal{I}_{i}^{R}}(t)) for any t0t\geq 0 if jiLj\notin\mathcal{I}_{i}^{L}. Therefore, J¯i(θ)\bar{J}_{i}(\theta) is independent of θi\theta_{i}.

Proof for (i): it has been shown in [36] that

θJδ(θ)=1δφdJ(θ+δu)e12u2u𝑑u,\nabla_{\theta}J^{\delta}(\theta)=\frac{1}{\delta\varphi}\int_{\mathbb{R}^{d}}J(\theta+\delta u)e^{-\frac{1}{2}\|u\|^{2}}udu, (43)

where φ=de12u2𝑑u\varphi=\int_{\mathbb{R}^{d}}e^{-\frac{1}{2}\|u\|^{2}}du. Define

Φi=(𝟎di×d1,,Idi,,𝟎di×dN)di×d,\Phi^{i}=(\mathbf{0}_{d_{i}\times d_{1}},...,I_{d_{i}},...,\mathbf{0}_{d_{i}\times d_{N}})\in\mathbb{R}^{d_{i}\times d}, (44)

then ui=Φiuu_{i}=\Phi^{i}u. It follows that

θiJδ(θ)\displaystyle\nabla_{\theta_{i}}J^{\delta}(\theta) =ΦiθJδ(θ)\displaystyle=\Phi^{i}\nabla_{\theta}J^{\delta}(\theta)
=1δφdJ(θ+δu)e12u2ui𝑑u\displaystyle=\frac{1}{\delta\varphi}\int_{\mathbb{R}^{d}}J(\theta+\delta u)e^{-\frac{1}{2}\|u\|^{2}}u_{i}du
=1δφdJ^i(θ+δu)e12u2ui𝑑u\displaystyle=\frac{1}{\delta\varphi}\int_{\mathbb{R}^{d}}\hat{J}_{i}(\theta+\delta u)e^{-\frac{1}{2}\|u\|^{2}}u_{i}du (45)
+1δφdJ¯i(θ+δu)e12u2ui𝑑u.\displaystyle~{}~{}+\frac{1}{\delta\varphi}\int_{\mathbb{R}^{d}}\bar{J}_{i}(\theta+\delta u)e^{-\frac{1}{2}\|u\|^{2}}u_{i}du.

Let θ¯i=(,θj,)jiddi\bar{\theta}_{i}=(...,\theta_{j}^{\top},...)^{\top}_{j\neq i}\in\mathbb{R}^{d-d_{i}}, u¯i=(,uj,)jiddi\bar{u}_{i}=(...,u_{j}^{\top},...)^{\top}_{j\neq i}\in\mathbb{R}^{d-d_{i}}. Since we have proved that J¯i(θ+δu)\bar{J}_{i}(\theta+\delta u) is independent of uiu_{i}, the following holds:

dJ¯i(θ+δu)e12u2ui𝑑u=ddiJ¯i(θ¯i+δu¯i)e12u¯i2𝑑u¯idie12ui2ui𝑑ui=0.\begin{split}&\int_{\mathbb{R}^{d}}\bar{J}_{i}(\theta+\delta u)e^{-\frac{1}{2}\|u\|^{2}}u_{i}du\\ &=\int_{\mathbb{R}^{d-d_{i}}}\bar{J}_{i}(\bar{\theta}_{i}+\delta\bar{u}_{i})e^{-\frac{1}{2}\|\bar{u}_{i}\|^{2}}d\bar{u}_{i}\int_{\mathbb{R}^{d_{i}}}e^{-\frac{1}{2}\|u_{i}\|^{2}}u_{i}du_{i}=0.\end{split}

Therefore,

θiJδ(θ)=1δφdJ^i(θ+δu)e12u2ui𝑑u=θiJ^iδ(θ).\nabla_{\theta_{i}}J^{\delta}(\theta)=\frac{1}{\delta\varphi}\int_{\mathbb{R}^{d}}\hat{J}_{i}(\theta+\delta u)e^{-\frac{1}{2}\|u\|^{2}}u_{i}du=\nabla_{\theta_{i}}\hat{J}_{i}^{\delta}(\theta). (46)

Proof for (ii): differentiability of Ji(θ)J_{i}(\theta) for all i𝒱i\in\mathcal{V} implies that J^i(θ)\hat{J}_{i}(\theta) for all i𝒱i\in\mathcal{V} and J(θ)J(\theta) are differentiable as well. Since J¯i(θ)\bar{J}_{i}(\theta) is independent of θi\theta_{i}, we have

θiJ(θ)=θi(J^i(θ)+J¯i(θ))=θiJ^i(θ).\nabla_{\theta_{i}}J(\theta)=\nabla_{\theta_{i}}(\hat{J}_{i}(\theta)+\bar{J}_{i}(\theta))=\nabla_{\theta_{i}}\hat{J}_{i}(\theta). (47)

This completes the proof. \blacksquare

Proof of Lemma 5. Given any policy π(θ)\pi(\theta), it holds that

t=0γtri(si(t),ai(t))t=0γtru=11γruJu.\sum_{t=0}^{\infty}\gamma^{t}r_{i}(s_{i}(t),a_{i}(t))\leq\sum_{t=0}^{\infty}\gamma^{t}r_{u}=\frac{1}{1-\gamma}r_{u}\triangleq J_{u}. (48)

Similarly, it can be shown that Jl=11γrlJ_{l}=\frac{1}{1-\gamma}r_{l}. \blacksquare

Proof of Lemma 6. From the definition of Ji(θ)J_{i}(\theta) and Wi(θ,ξi)W_{i}(\theta,\xi_{i}), we have

|Ji(θ)𝔼[Wi(θ,ξ)]|=|𝔼[t=Teγtri(si(t),ai(t))]|t=Teγtr0=γTe1γr0.\begin{split}\left|J_{i}(\theta)-\mathbb{E}[W_{i}(\theta,\xi)]\right|&=\left|\mathbb{E}\left[\sum_{t=T_{e}}^{\infty}\gamma^{t}r_{i}(s_{i}(t),a_{i}(t))\right]\right|\\ &\leq\sum_{t=T_{e}}^{\infty}\gamma^{t}r_{0}=\frac{\gamma^{T_{e}}}{1-\gamma}r_{0}.\end{split} (49)

Then we have

|J^i(θ)𝔼[W^i(θ,ξ)]|jiL|Jj(θ)𝔼[Wj(θ,ξ)]|nlγTe1γr0.|\hat{J}_{i}(\theta)-\mathbb{E}[\hat{W}_{i}(\theta,\xi)]|\leq\sum_{j\in\mathcal{I}_{i}^{L}}\left|J_{j}(\theta)-\mathbb{E}[W_{j}(\theta,\xi)]\right|\leq\frac{n_{l}\gamma^{T_{e}}}{1-\gamma}r_{0}.

The proof is completed. \blacksquare

Proof of Lemma 7. Note that the following holds for any step vv:

jiLμjkl(v+1)=𝟏nlμkl(v+1)=𝟏nlC0lμkl(v)=𝟏nlμkl(v),\sum_{j\in\mathcal{I}_{i}^{L}}\mu_{j}^{kl}(v+1)=\mathbf{1}_{n_{l}}^{\top}\mu^{kl}(v+1)=\mathbf{1}_{n_{l}}^{\top}C^{l}_{0}\mu^{kl}(v)=\mathbf{1}_{n_{l}}^{\top}\mu^{kl}(v), (50)

where the last equality holds because C0lC_{0}^{l} is doubly stochastic. It follows that 𝟏nlμkl(v)=W^i(θk+δuk,ξik)\mathbf{1}_{n_{l}}^{\top}\mu^{kl}(v)=\hat{W}_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k}) for any v[0,Tc]v\in[0,T_{c}].

Next we evaluate the estimation error. The following holds:

nlμkl(v)𝟏nlW^i(θk+δuk,ξik)=nlμkl(v)𝟏nl𝟏nlμkl(v)=nlC0lμkl(v1)𝟏nl𝟏nlμkl(v1)=(C0l1nl𝟏nl𝟏nl)(nlμkl(v1)𝟏nl𝟏nlμkl(v1))=(C0l1nl𝟏nl𝟏nl)v(nlμkl(0)𝟏nl𝟏nlμkl(0))=(C0l1nl𝟏nl𝟏nl)v(nlμkl(0)𝟏nlW^i(θk+δuk,ξik)),\begin{split}&n_{l}\mu^{kl}(v)-\mathbf{1}_{n_{l}}\hat{W}_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k})\\ &=n_{l}\mu^{kl}(v)-\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top}\mu^{kl}(v)\\ &=n_{l}C^{l}_{0}\mu^{kl}(v-1)-\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top}\mu^{kl}(v-1)\\ &=(C^{l}_{0}-\frac{1}{n_{l}}\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top})(n_{l}\mu^{kl}(v-1)-\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top}\mu^{kl}(v-1))\\ &=(C^{l}_{0}-\frac{1}{n_{l}}\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top})^{v}(n_{l}\mu^{kl}(0)-\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top}\mu^{kl}(0))\\ &=(C^{l}_{0}-\frac{1}{n_{l}}\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top})^{v}(n_{l}\mu^{kl}(0)-\mathbf{1}_{n_{l}}\hat{W}_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k})),\end{split} (51)

where the second equality used (50) and the third equality holds because (C0l1nl𝟏nl𝟏nl)𝟏nl𝟏nlμkl(t)=0(C^{l}_{0}-\frac{1}{n_{l}}\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top})\mathbf{1}_{n_{l}}\mathbf{1}_{n_{l}}^{\top}\mu^{kl}(t)=0.

As a result, for any v[0,Tc]v\in[0,T_{c}], we have

𝔼[nlμkl(v)]𝟏nlJ^i(θk+δuk)\displaystyle\left\|\mathbb{E}\left[n_{l}\mu^{kl}(v)\right]-\mathbf{1}_{n_{l}}\hat{J}_{i}(\theta^{k}+\delta u^{k})\right\|
𝔼[nlμkl(v)𝟏nlW^i(θk+δuk,ξik)]\displaystyle\leq\left\|\mathbb{E}\left[n_{l}\mu^{kl}(v)-\mathbf{1}_{n_{l}}\hat{W}_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k})\right]\right\|
+𝟏nlJ^i(θk+δuk)𝔼[𝟏nlW^i(θk+δuk,ξik)]\displaystyle+\left\|\mathbf{1}_{n_{l}}\hat{J}_{i}(\theta^{k}+\delta u^{k})-\mathbb{E}\left[\mathbf{1}_{n_{l}}\hat{W}_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k})\right]\right\|
ρlv𝔼[nlμkl(0)𝟏nlW^i(θk+δuk,ξik)]+nl2γTeJ0\displaystyle\leq\rho_{l}^{v}\left\|\mathbb{E}\left[n_{l}\mu^{kl}(0)-\mathbf{1}_{n_{l}}\hat{W}_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k})\right]\right\|+n_{l}^{2}\gamma^{T_{e}}J_{0}
ρlv𝔼[nlμkl(0)𝟏nlJ^i(θk+δuk)]+(ρlv+1)nl2γTeJ0\displaystyle\leq\rho_{l}^{v}\left\|\mathbb{E}\left[n_{l}\mu^{kl}(0)-\mathbf{1}_{n_{l}}\hat{J}_{i}(\theta^{k}+\delta u^{k})\right]\right\|+(\rho_{l}^{v}+1)n_{l}^{2}\gamma^{T_{e}}J_{0}
ρlvnl2(JuJl)+(ρlv+1)nl2γTeJ0\displaystyle\leq\rho_{l}^{v}n_{l}^{2}(J_{u}-J_{l})+(\rho_{l}^{v}+1)n_{l}^{2}\gamma^{T_{e}}J_{0}
=ρlvnl2(JuJl+γTeJ0)+nl2γTeJ0,\displaystyle=\rho_{l}^{v}n_{l}^{2}\left(J_{u}-J_{l}+\gamma^{T_{e}}J_{0}\right)+n_{l}^{2}\gamma^{T_{e}}J_{0}, (52)

where the second inequality used (51) and Lemma 6, the third inequality used Lemma 6 again, and the last inequality used the uniform bound of JiJ_{i} and the fact that 𝔼[μikl(0)]=𝔼[Wi(θk+δuk,ξik)]Jiδ(θk)\mathbb{E}[\mu^{kl}_{i}(0)]=\mathbb{E}[W_{i}(\theta^{k}+\delta u^{k},\xi_{i}^{k})]\leq J_{i}^{\delta}(\theta^{k}).

Due to Assumption 4, we have minjiLWj(θk+δuk,ξik)μikl(Tc)maxjiLWj(θk+δuk,ξik)\min_{j\in\mathcal{I}_{i}^{L}}W_{j}(\theta^{k}+\delta u^{k},\xi_{i}^{k})\leq\mu_{i}^{kl}(T_{c})\leq\max_{j\in\mathcal{I}_{i}^{L}}W_{j}(\theta^{k}+\delta u^{k},\xi_{i}^{k}). Let i0=argmaxjiL|Wj(θk+δuk,ξik)|i_{0}=\arg\max_{j\in\mathcal{I}_{i}^{L}}|W_{j}(\theta^{k}+\delta u^{k},\xi_{i}^{k})|. Then we have

𝔼ξk[[nlμikl(Tc)]2]nl2𝔼ξk[Wi02(θk+δuk,ξk)]\displaystyle\mathbb{E}_{\xi^{k}\sim\mathcal{H}}\left[\left[n_{l}\mu_{i}^{kl}(T_{c})\right]^{2}\right]\leq n_{l}^{2}\mathbb{E}_{\xi^{k}\sim\mathcal{H}}\left[W_{i_{0}}^{2}(\theta^{k}+\delta u^{k},\xi^{k})\right]
=nl2𝔼ξk[(ξi0k)2]+nl2(𝔼ξk[Wi0(θk+δuk,ξk)])2\displaystyle=n_{l}^{2}\mathbb{E}_{\xi^{k}\sim\mathcal{H}}[(\xi^{k}_{i_{0}})^{2}]+n_{l}^{2}(\mathbb{E}_{\xi^{k}\sim\mathcal{H}}[W_{i_{0}}(\theta^{k}+\delta u^{k},\xi^{k})])^{2}
nl2σ02+nl2(Ji0(θk+δuk)+γTeJ0)2\displaystyle\leq n_{l}^{2}\sigma_{0}^{2}+n_{l}^{2}(J_{i_{0}}(\theta^{k}+\delta u^{k})+\gamma^{T_{e}}J_{0})^{2}
nl2(σ02+(1+γTe)2J02).\displaystyle\leq n_{l}^{2}(\sigma_{0}^{2}+(1+\gamma^{T_{e}})^{2}J_{0}^{2}). (53)

The proof is completed. \blacksquare

Proof of Lemma 8. According to (17), we have

𝔼[gi(θk,\displaystyle\mathbb{E}[\|g_{i}(\theta^{k}, uk,ξk)2]=1δ2𝔼uik[𝔼ξk[(nlμikl(Tc))2]uik2]\displaystyle u^{k},\xi^{k})\|^{2}]=\frac{1}{\delta^{2}}\mathbb{E}_{u_{i}^{k}}\left[\mathbb{E}_{\xi^{k}\sim\mathcal{H}}\left[\left(n_{l}\mu_{i}^{kl}(T_{c})\right)^{2}\right]\|u_{i}^{k}\|^{2}\right]
Blμδ2𝔼[uik2]\displaystyle\leq\frac{B^{\mu}_{l}}{\delta^{2}}\mathbb{E}\left[\|u^{k}_{i}\|^{2}\right]
=Blμδ2φdiuik2e12uik2𝑑uikddie12v2𝑑v\displaystyle=\frac{B^{\mu}_{l}}{\delta^{2}\varphi}\int_{\mathbb{R}^{d_{i}}}\|u_{i}^{k}\|^{2}e^{-\frac{1}{2}\|u_{i}^{k}\|^{2}}du_{i}^{k}\int_{\mathbb{R}^{d-d_{i}}}e^{-\frac{1}{2}\|v\|^{2}}dv
Blμδ2φdidie12uik2𝑑uikddie12v2𝑑v\displaystyle\leq\frac{B^{\mu}_{l}}{\delta^{2}\varphi}d_{i}\int_{\mathbb{R}^{d_{i}}}e^{-\frac{1}{2}\|u_{i}^{k}\|^{2}}du_{i}^{k}\int_{\mathbb{R}^{d-d_{i}}}e^{-\frac{1}{2}\|v\|^{2}}dv
=Blμdiδ2,\displaystyle=\frac{B^{\mu}_{l}d_{i}}{\delta^{2}}, (54)

where φ\varphi is defined in (43), the first inequality used (21), and the second inequality holds because diuik2e12uik2𝑑uikdidie12uik2𝑑uik\int_{\mathbb{R}^{d_{i}}}\|u_{i}^{k}\|^{2}e^{-\frac{1}{2}\|u_{i}^{k}\|^{2}}du_{i}^{k}\leq d_{i}\int_{\mathbb{R}^{d_{i}}}e^{-\frac{1}{2}\|u_{i}^{k}\|^{2}}du_{i}^{k}, which has been proved in [36, Lemma 1]. \blacksquare

Proof of Theorem 1. Statement (i) can be obtained by using the Lipschitz continuity of J(θ)J(\theta). The details have been shown in [36, Theorem 1].

Now we prove statement (ii). According to Assumption 2 and [36, Lemma 2], the gradient of Jδ(θ)J^{\delta}(\theta) is dL/r\sqrt{d}L/r-Lipschitz continuous. Let g(θk)=(g1(θk),,gN(θk))dg(\theta^{k})=(g_{1}^{\top}(\theta^{k}),...,g_{N}^{\top}(\theta^{k}))^{\top}\in\mathbb{R}^{d}, the following holds:

|Jδ(θk+1)Jδ(θk)θJδ(θk),ηg(θk)|dL2rη2g(θk,uk,ξk)2,|J^{\delta}(\theta^{k+1})-J^{\delta}(\theta^{k})-\langle\nabla_{\theta}J^{\delta}(\theta^{k}),\eta g(\theta^{k})\rangle|\\ \leq\frac{\sqrt{d}L}{2r}\eta^{2}\|g(\theta^{k},u^{k},\xi^{k})\|^{2}, (55)

which implies that

θJδ(θk),ηg(θk)Jδ(θk+1)Jδ(θk)+dL2rη2g(θk,uk,ξk)2.\langle\nabla_{\theta}J^{\delta}(\theta^{k}),\eta g(\theta^{k})\rangle\\ \leq J^{\delta}(\theta^{k+1})-J^{\delta}(\theta^{k})+\frac{\sqrt{d}L}{2r}\eta^{2}\|g(\theta^{k},u^{k},\xi^{k})\|^{2}. (56)

Lemma 8 implies that

𝔼[g(θk,uk,ξk)2]=i=1N𝔼[gi(θk,uk,ξk)2]i=1NBlμdi/δ2B0μd/δ2,\begin{split}\mathbb{E}[\|g(\theta^{k},u^{k},\xi^{k})\|^{2}]&=\sum_{i=1}^{N}\mathbb{E}[\|g_{i}(\theta^{k},u^{k},\xi^{k})\|^{2}]\\ &\leq\sum_{i=1}^{N}B^{\mu}_{l}d_{i}/\delta^{2}\leq B^{\mu}_{0}d/\delta^{2},\end{split} (57)

where B0μ=maxl𝒞Blμ=n02(σ02+(1+γTe)2J02)B_{0}^{\mu}=\max_{l\in\mathcal{C}}B^{\mu}_{l}=n_{0}^{2}(\sigma_{0}^{2}+(1+\gamma^{T_{e}})^{2}J_{0}^{2}).

Moreover, Lemma 7 implies that

𝔼[θiJδ(θk),gi(θk,uk,ξk)]\displaystyle\mathbb{E}\left[\langle\nabla_{\theta_{i}}J^{\delta}(\theta^{k}),g_{i}(\theta^{k},u^{k},\xi^{k})\rangle\right]
=𝔼[θiJδ(θk)2]\displaystyle=\mathbb{E}[\|\nabla_{\theta_{i}}J^{\delta}(\theta^{k})\|^{2}]
+𝔼[θiJδ(θk),nlμikl(Tc)uik/δJ^i(θk+δuk)uik/δ]\displaystyle~{}~{}~{}+\mathbb{E}[\langle\nabla_{\theta_{i}}J^{\delta}(\theta^{k}),n_{l}\mu_{i}^{kl}(T_{c})u^{k}_{i}/\delta-\hat{J}_{i}(\theta^{k}+\delta u^{k})u^{k}_{i}/\delta\rangle]
𝔼[θiJ(θk)2]12𝔼[θiJ(θk)2+Ei2uik2/δ2]\displaystyle\geq\mathbb{E}[\|\nabla_{\theta_{i}}J(\theta^{k})\|^{2}]-\frac{1}{2}\mathbb{E}\left[\|\nabla_{\theta_{i}}J(\theta^{k})\|^{2}+E_{i}^{2}\|u_{i}^{k}\|^{2}/\delta^{2}\right]
=12𝔼[θiJ(θk)2]Ei2di2δ2,\displaystyle=\frac{1}{2}\mathbb{E}[\|\nabla_{\theta_{i}}J(\theta^{k})\|^{2}]-\frac{E_{i}^{2}d_{i}}{2\delta^{2}}, (58)

where the first equality used Lemma 2 and θiJ^δ(θk)=𝔼[J^i(θk+δuk)uik/δ]\nabla_{\theta_{i}}\hat{J}^{\delta}(\theta^{k})=\mathbb{E}[\hat{J}_{i}(\theta^{k}+\delta u^{k})u^{k}_{i}/\delta], the inequality used (20). Summing (IX) over ii from 0 to NN yields

𝔼[θJδ(θk),g(θk,uk,ξk)]12𝔼[θJ(θk)2]E02d2δ2,\mathbb{E}[\langle\nabla_{\theta}J^{\delta}(\theta^{k}),g(\theta^{k},u^{k},\xi^{k})\rangle]\geq\frac{1}{2}\mathbb{E}[\|\nabla_{\theta}J(\theta^{k})\|^{2}]-\frac{E_{0}^{2}d}{2\delta^{2}}, (59)

where E0=maxi𝒱EiE_{0}=\max_{i\in\mathcal{V}}E_{i}.

Combining (59) and (56), and taking expectation on both sides, we obtain

12η𝔼[θJδ(θk)2]ηE02d2δ2\displaystyle\frac{1}{2}\eta\mathbb{E}[\|\nabla_{\theta}J^{\delta}(\theta^{k})\|^{2}]-\eta\frac{E_{0}^{2}d}{2\delta^{2}}
𝔼[Jδ(θk+1)Jδ(θk)]+dL2δη2𝔼[g(θk,uk,ξk)2]\displaystyle\leq\mathbb{E}[J^{\delta}(\theta^{k+1})-J^{\delta}(\theta^{k})]+\frac{\sqrt{d}L}{2\delta}\eta^{2}\mathbb{E}[\|g(\theta^{k},u^{k},\xi^{k})\|^{2}]
𝔼[Jδ(θk+1)Jδ(θk)]+LB0μd1.52δ3η2,\displaystyle\leq\mathbb{E}[J^{\delta}(\theta^{k+1})-J^{\delta}(\theta^{k})]+\frac{LB^{\mu}_{0}d^{1.5}}{2\delta^{3}}\eta^{2}, (60)

where the second inequality employed (57).

Note that under the conditions on TcT_{c}, we have

E0ϵ1.52Ld.E_{0}\leq\frac{\epsilon^{1.5}}{\sqrt{2}Ld}. (61)

Summing (IX) over kk from 0 to K1K-1 and dividing both sides by KK, yields

1Kk=0K1𝔼[θJδ(θk)2]\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla_{\theta}J^{\delta}(\theta^{k})\|^{2}]
2η[1K(𝔼[Jδ(θK)]Jδ(θ0))+LB0μd1.52δ3η2]+E02dδ2\displaystyle\leq\frac{2}{\eta}\left[\frac{1}{K}\left(\mathbb{E}[J^{\delta}(\theta^{K})]-J^{\delta}(\theta^{0})\right)+\frac{LB^{\mu}_{0}d^{1.5}}{2\delta^{3}}\eta^{2}\right]+\frac{E_{0}^{2}d}{\delta^{2}}
2η[1K(NJuJδ(θ0))+B0μLd1.52δ3η2]+ϵ2\displaystyle\leq\frac{2}{\eta}\left[\frac{1}{K}(NJ_{u}-J^{\delta}(\theta^{0}))+B_{0}^{\mu}\frac{Ld^{1.5}}{2\delta^{3}}\eta^{2}\right]+\frac{\epsilon}{2}
d1.5ϵ1.5K[2(NJuJδ(θ0))+L4B0μ]+ϵ2,\displaystyle\leq\frac{d^{1.5}}{\epsilon^{1.5}\sqrt{K}}\left[2(NJ_{u}-J^{\delta}(\theta^{0}))+L^{4}B_{0}^{\mu}\right]+\frac{\epsilon}{2}, (62)

where the second inequality used (61), the last inequality used the conditions on δ\delta and η\eta. The proof is completed. \blacksquare

Proof of Lemma 9. When κDl\kappa\geq D^{*}_{l}, (25) implies J~iδ(θ)=J^iδ(θ)\tilde{J}_{i}^{\delta}(\theta)=\hat{J}_{i}^{\delta}(\theta). By Lemma 2, θiJ~iδ(θ)θiJδ(θ)=0\|\nabla_{\theta_{i}}\tilde{J}_{i}^{\delta}(\theta)-\nabla_{\theta_{i}}J^{\delta}(\theta)\|=0. Next we analyze the other case.

Let riθ(si(t),ai(t))r_{i}^{\theta}(s_{i}(t),a_{i}(t)) be the individual reward of agent ii at time tt under the global policy π(θ)\pi(\theta). Due to Lemma 2, it suffices to analyze θiJ~iδθiJ^iδ\|\nabla_{\theta_{i}}\tilde{J}_{i}^{\delta}-\nabla_{\theta_{i}}\hat{J}_{i}^{\delta}\|.

Let Ji,κδ(θ)=j𝒱¯lκt=0κγtrj(t)(sj(t),aj(t))J_{i,\kappa}^{\delta}(\theta)=\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}\sum_{t=0}^{\kappa}\gamma^{t}r_{j}(t)(s_{j}(t),a_{j}(t)), and J¯i,κδ(θ)=j𝒱¯lκt=κ+1γtrj(t)(sj(t),aj(t))\bar{J}^{\delta}_{i,\kappa}(\theta)=\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}\sum_{t=\kappa+1}^{\infty}\gamma^{t}r_{j}(t)(s_{j}(t),a_{j}(t)). Then

J~iδ(θ)J^iδ(θ)=Ji,κδ(θ)+J¯i,κδ(θ).\tilde{J}^{\delta}_{i}(\theta)-\hat{J}_{i}^{\delta}(\theta)=J_{i,\kappa}^{\delta}(\theta)+\bar{J}^{\delta}_{i,\kappa}(\theta). (63)

Notice that if D(𝒱l,j)>κD(\mathcal{V}_{l},j)>\kappa, then the reward of each agent j𝒱¯lκj\in\bar{\mathcal{V}}_{l}^{\kappa} is not affected by cluster 𝒱l\mathcal{V}_{l} at any time step tκt\leq\kappa. Therefore, θiJi,κδ(θ)=0\nabla_{\theta_{i}}J_{i,\kappa}^{\delta}(\theta)=0, which leads to (64),

θiJ~iδ(θ)θiJ^iδ(θ)=θiJ¯i,κδ(θ)=𝔼[j𝒱¯lκt=κ+1γtrjθ+δu(sj(t),aj(t))t=κ+1γtrjθ(sj(t),aj(t))δui]j𝒱¯lκ𝔼u𝒩(0,Id)[𝔼s𝒟t=κ+1γtrjθ+δu(sj(t),aj(t))rjθ(sj(t),aj(t))δui|s0=s]=γκ+1j𝒱¯lκ𝔼u𝒩(0,Id)[𝔼s𝒟[Vjπ(θ+δu)(sj,κ+1)Vjπ(θ)(sj,κ+1)δui|s0=s]]γκ+1j𝒱¯lκLj𝔼[uui]γκ+1j𝒱¯lκLj(𝔼[u2])1/2(𝔼[ui2])1/2γκ+1j𝒱¯lκLjddi,\begin{split}&\|\nabla_{\theta_{i}}\tilde{J}_{i}^{\delta}(\theta)-\nabla_{\theta_{i}}\hat{J}_{i}^{\delta}(\theta)\|=\|\nabla_{\theta_{i}}\bar{J}^{\delta}_{i,\kappa}(\theta)\|\\ &=\Bigg{\|}\mathbb{E}\left[\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}\frac{\sum_{t=\kappa+1}^{\infty}\gamma^{t}r_{j}^{\theta+\delta u}(s_{j}(t),a_{j}(t))-\sum_{t=\kappa+1}^{\infty}\gamma^{t}r_{j}^{\theta}(s_{j}(t),a_{j}(t))}{\delta}u_{i}\right]\Bigg{\|}\\ &\leq\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}\mathbb{E}_{u\sim\mathcal{N}(0,I_{d})}\left[\|\mathbb{E}_{s\sim\mathcal{D}}\sum_{t=\kappa+1}^{\infty}\gamma^{t}\frac{r_{j}^{\theta+\delta u}(s_{j}(t),a_{j}(t))-r_{j}^{\theta}(s_{j}(t),a_{j}(t))}{\delta}u_{i}\Big{|}s_{0}=s\|\right]\\ &=\gamma^{\kappa+1}\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}\mathbb{E}_{u\sim\mathcal{N}(0,I_{d})}\left[\|\mathbb{E}_{s\sim\mathcal{D}}\left[\frac{V^{\pi(\theta+\delta u)}_{j}(s_{j,\kappa+1})-V^{\pi(\theta)}_{j}(s_{j,\kappa+1})}{\delta}u_{i}\Big{|}s_{0}=s\right]\|\right]\\ &\leq\gamma^{\kappa+1}\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}L_{j}\mathbb{E}[\|u\|\|u_{i}\|]\\ &\leq\gamma^{\kappa+1}\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}L_{j}(\mathbb{E}[\|u\|^{2}])^{1/2}(\mathbb{E}[\|u_{i}\|^{2}])^{1/2}\\ &\leq\gamma^{\kappa+1}\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}L_{j}\sqrt{dd_{i}},\end{split} (64)

where the second equality used the two-point feedback zeroth-order oracle [36], the third equality used the definition of Viπ(θ)(s)V_{i}^{\pi(\theta)}(s), the second inequality used Assumption 2, the third inequality used Holder’s inequality. The proof is completed. \blacksquare

Proof of Theorem 2. Here we only show the different part compared with the proof of Theorem 1.

𝔼[θiJδ(θk),gi(θk,uk,ξk)]=𝔼[θiJδ(θk)2]\displaystyle\mathbb{E}\left[\langle\nabla_{\theta_{i}}J^{\delta}(\theta^{k}),g_{i}(\theta^{k},u^{k},\xi^{k})\rangle\right]=\mathbb{E}[\|\nabla_{\theta_{i}}J^{\delta}(\theta^{k})\|^{2}]
+𝔼[θiJδ(θk),nlμikl(Tc)uik/δJ~i(θk+δuk)uik/δ]\displaystyle+\mathbb{E}[\langle\nabla_{\theta_{i}}J^{\delta}(\theta^{k}),n_{l}\mu_{i}^{kl}(T_{c})u^{k}_{i}/\delta-\tilde{J}_{i}(\theta^{k}+\delta u^{k})u^{k}_{i}/\delta\rangle]
+𝔼[θiJδ(θk),θiJ~i(θk)θiJ^i(θk)]\displaystyle+\mathbb{E}[\langle\nabla_{\theta_{i}}J^{\delta}(\theta^{k}),\nabla_{\theta_{i}}\tilde{J}_{i}(\theta^{k})-\nabla_{\theta_{i}}\hat{J}_{i}(\theta^{k})\rangle]
𝔼[θiJ(θk)2]12𝔼[θiJ(θk)2](Eiκ)2uik2/δ2diδ2Ai2\displaystyle\geq\mathbb{E}[\|\nabla_{\theta_{i}}J(\theta^{k})\|^{2}]-\frac{1}{2}\mathbb{E}\left[\|\nabla_{\theta_{i}}J(\theta^{k})\|^{2}\right]-(E_{i}^{\kappa})^{2}\|u_{i}^{k}\|^{2}/\delta^{2}-\frac{d_{i}}{\delta^{2}}A_{i}^{2}
=12𝔼[θiJ(θk)2](Eiκ)2diδ2diδ2Ai2,\displaystyle=\frac{1}{2}\mathbb{E}[\|\nabla_{\theta_{i}}J(\theta^{k})\|^{2}]-\frac{(E_{i}^{\kappa})^{2}d_{i}}{\delta^{2}}-\frac{d_{i}}{\delta^{2}}A_{i}^{2}, (65)

where Ai=γκ+1j𝒱¯lκLjddiA_{i}=\gamma^{\kappa+1}\sum_{j\in\bar{\mathcal{V}}_{l}^{\kappa}}L_{j}\sqrt{dd_{i}}, the first equality used 𝔼uik[J~i(θk+δuk)uik/δ]=θiJ~i(θk)\mathbb{E}_{u^{k}_{i}}[\tilde{J}_{i}(\theta^{k}+\delta u^{k})u^{k}_{i}/\delta]=\nabla_{\theta_{i}}\tilde{J}_{i}(\theta^{k}) and 𝔼uik[J^i(θk+δuk)uik/δ]=θiJ^i(θk)\mathbb{E}_{u^{k}_{i}}[\hat{J}_{i}(\theta^{k}+\delta u^{k})u^{k}_{i}/\delta]=\nabla_{\theta_{i}}\hat{J}_{i}(\theta^{k}), the inequality used ab14a2b2ab\geq-\frac{1}{4}a^{2}-b^{2}. Then we have

𝔼[θJδ(θk),g(θk,uk,ξk)]12𝔼[θJ(θk)2](E0κ)2dδ2dδ2A02,\mathbb{E}[\langle\nabla_{\theta}J^{\delta}(\theta^{k}),g(\theta^{k},u^{k},\xi^{k})\rangle]\geq\frac{1}{2}\mathbb{E}[\|\nabla_{\theta}J(\theta^{k})\|^{2}]-\frac{(E_{0}^{\kappa})^{2}d}{\delta^{2}}-\frac{d}{\delta^{2}}A_{0}^{2},

where E0κ=maxi𝒱EiκE_{0}^{\kappa}=\max_{i\in\mathcal{V}}E_{i}^{\kappa}, A0=maxi𝒱Ai=γκ+1maxl𝒞|𝒱¯lκ|L0dd0A_{0}=\max_{i\in\mathcal{V}}A_{i}=\gamma^{\kappa+1}\max_{l\in\mathcal{C}}|\bar{\mathcal{V}}_{l}^{\kappa}|L_{0}\sqrt{dd_{0}}.

It follows that

12η𝔼[θJδ(θk)2]η(E0κ)2d2δ2dδ2A02\displaystyle\frac{1}{2}\eta\mathbb{E}[\|\nabla_{\theta}J^{\delta}(\theta^{k})\|^{2}]-\eta\frac{(E_{0}^{\kappa})^{2}d}{2\delta^{2}}-\frac{d}{\delta^{2}}A_{0}^{2}
𝔼[Jδ(θk+1)Jδ(θk)]+dL2δη2𝔼[g(θk,uk,ξk)2]\displaystyle\leq\mathbb{E}[J^{\delta}(\theta^{k+1})-J^{\delta}(\theta^{k})]+\frac{\sqrt{d}L}{2\delta}\eta^{2}\mathbb{E}[\|g(\theta^{k},u^{k},\xi^{k})\|^{2}]
𝔼[Jδ(θk+1)Jδ(θk)]+LB0κd1.52δ3η2,\displaystyle\leq\mathbb{E}[J^{\delta}(\theta^{k+1})-J^{\delta}(\theta^{k})]+\frac{LB^{\kappa}_{0}d^{1.5}}{2\delta^{3}}\eta^{2}, (66)

where B0κ=(nlκ)2(σ02+J02)B_{0}^{\kappa}=(n_{l}^{\kappa})^{2}(\sigma_{0}^{2}+J_{0}^{2}).

Therefore, the following holds:

1Kk=0K1𝔼[θJδ(θk)2]\displaystyle\frac{1}{K}\sum_{k=0}^{K-1}\mathbb{E}[\|\nabla_{\theta}J^{\delta}(\theta^{k})\|^{2}]
2η[1K(𝔼[Jδ(θK)]Jδ(θ0))+LB0κd1.52δ3η2]+2E02dδ2+2dδ2A02\displaystyle\leq\frac{2}{\eta}\left[\frac{1}{K}\left(\mathbb{E}[J^{\delta}(\theta^{K})]-J^{\delta}(\theta^{0})\right)+\frac{LB^{\kappa}_{0}d^{1.5}}{2\delta^{3}}\eta^{2}\right]+\frac{2E_{0}^{2}d}{\delta^{2}}+\frac{2d}{\delta^{2}}A_{0}^{2}
2η[1K(NJuJδ(θ0))+(nlκ)2(σ02+J02)Ld1.52δ3η2]+ϵ2+2dδ2A02\displaystyle\leq\frac{2}{\eta}\left[\frac{1}{K}(NJ_{u}-J^{\delta}(\theta^{0}))+(n_{l}^{\kappa})^{2}(\sigma_{0}^{2}+J_{0}^{2})\frac{Ld^{1.5}}{2\delta^{3}}\eta^{2}\right]+\frac{\epsilon}{2}+\frac{2d}{\delta^{2}}A_{0}^{2}
d1.5ϵ1.5K[2(NJuJδ(θ0))+(nlκ)2(σ02+J02)L4]+ϵ2+2dδ2A02.\displaystyle\leq\frac{d^{1.5}}{\epsilon^{1.5}\sqrt{K}}\left[2(NJ_{u}-J^{\delta}(\theta^{0}))+(n_{l}^{\kappa})^{2}(\sigma_{0}^{2}+J_{0}^{2})L^{4}\right]+\frac{\epsilon}{2}+\frac{2d}{\delta^{2}}A_{0}^{2}. (67)

The proof is completed. \blacksquare

X Appendix B: Properties of Cluster-Wise Graphs

In this appendix, we will analyze the relationships among graphs 𝒢X\mathcal{G}_{X}, X{S,O,R,L}X\in\{S,O,R,L\} from the cluster-wise perspective.

Inspired by the observations in Subsection III-B, the graph 𝒢L\mathcal{G}_{L} in Fig. 4 can be interpreted from a cluster perspective. By regarding each cluster121212Please note that the clustering in this paper is only conducted once for graph 𝒢SO\mathcal{G}_{SO}. The clusters discussed in other graphs still correspond to SCCs in graph 𝒢SO\mathcal{G}_{SO}. (corresponding to a maximal SCC in 𝒢SO\mathcal{G}_{SO}) as a node, and adding a directional edge (l1,l2)(l_{1},l_{2}) between any pair of nodes l1,l2𝒞l_{1},l_{2}\in\mathcal{C} as long as there is at least one edge from cluster l1l_{1} to cluster l2l_{2} in 𝒢L\mathcal{G}_{L}, we define the cluster-wise graph of 𝒢L\mathcal{G}_{L} as the graph 𝒢Lcl\mathcal{G}_{L}^{cl} in Fig. 8 (a). Similarly, we define cluster-wise graphs for 𝒢SO\mathcal{G}_{SO} and 𝒢SOR\mathcal{G}_{SOR} as 𝒢SOcl\mathcal{G}_{SO}^{cl} and 𝒢SORcl\mathcal{G}_{SOR}^{cl}, respectively. In Example 1, due to the setting that there are no edges between different clusters in 𝒢R\mathcal{G}_{R}, it holds that 𝒢SOcl=𝒢SORcl\mathcal{G}_{SO}^{cl}=\mathcal{G}_{SOR}^{cl}, as shown in Fig. 8 (b). Note that 𝒢SO𝒢SOR\mathcal{G}_{SO}\neq\mathcal{G}_{SOR} since RSO\mathcal{E}_{R}\not\subseteq\mathcal{E}_{SO}.

Refer to caption
Figure 8: (a). The cluster-wise learning graph 𝒢Lcl\mathcal{G}_{L}^{cl}. (b). The cluster-wise graph 𝒢SOcl\mathcal{G}_{SO}^{cl} or 𝒢SORcl\mathcal{G}_{SOR}^{cl}.

The cluster-wise graph 𝒢cl\mathcal{G}^{cl} for a graph 𝒢\mathcal{G} is constructed by regarding each cluster as one node, and adding one edge between two nodes if there is an edge between two agents belonging to these two clusters in 𝒢\mathcal{G}, respectively. Note that a SCC in 𝒢SO\mathcal{G}_{SO} remains to be a SCC in 𝒢L\mathcal{G}_{L} and 𝒢SOR\mathcal{G}_{SOR}. Therefore, an edge from cluster l1l_{1} to cluster l2l_{2} in 𝒢cl\mathcal{G}^{cl} always implies that any vertex j𝒱l2j\in\mathcal{V}_{l_{2}} is reachable from any vertex i𝒱l1i\in\mathcal{V}_{l_{1}} in the corresponding node-wise graph 𝒢\mathcal{G}. Based on this fact, we give a specific result regarding the relationship between 𝒢Lcl\mathcal{G}_{L}^{cl}, 𝒢SOcl\mathcal{G}_{SO}^{cl} and 𝒢SORcl\mathcal{G}_{SOR}^{cl} as below.

Lemma 13

Given 𝒢S\mathcal{G}_{S}, 𝒢O\mathcal{G}_{O}, 𝒢R\mathcal{G}_{R} and the induced 𝒢L\mathcal{G}_{L}, the following statements are true:

(i) (SOcl)Lcl(SORcl)(\mathcal{E}_{SO}^{cl})^{\top}\subseteq\mathcal{E}_{L}^{cl}\subseteq(\mathcal{E}_{SOR}^{cl})^{\top};

(ii) (𝒢SOcl)=𝒢Lcl=(𝒢SORcl)(\mathcal{G}_{SO}^{cl})^{\top}=\mathcal{G}_{L}^{cl}=(\mathcal{G}_{SOR}^{cl})^{\top} if RSO\mathcal{E}_{R}\subseteq\mathcal{E}_{SO};

(iii) (𝒢SOcl)=𝒢Lcl=(𝒢SORcl)(\mathcal{G}_{SO}^{cl})^{\top}=\mathcal{G}_{L}^{cl}=(\mathcal{G}_{SOR}^{cl})^{\top} if and only if iSOji\stackrel{{\scriptstyle\mathcal{E}_{SO}}}{{\longrightarrow}}j for any (i,j)R(i,j)\in\mathcal{E}_{R}.

Proof. (i). Given (l1,l2)SOcl(l_{1},l_{2})\in\mathcal{E}_{SO}^{cl}, there must hold that iSOji\stackrel{{\scriptstyle\mathcal{E}_{SO}}}{{\longrightarrow}}j for any i𝒱l1i\in\mathcal{V}_{l_{1}} and j𝒱l2j\in\mathcal{V}_{l_{2}}. Due to the definition of 𝒢L\mathcal{G}_{L}, we have jiLj\in\mathcal{I}_{i}^{L} (i.e., (j,i)L(j,i)\in\mathcal{E}_{L}) for any i𝒱l1i\in\mathcal{V}_{l_{1}} and j𝒱l2j\in\mathcal{V}_{l_{2}}, implying that (l1,l2)(Lcl)(l_{1},l_{2})\in(\mathcal{E}_{L}^{cl})^{\top}. Therefore, SOcl(Lcl)\mathcal{E}_{SO}^{cl}\subseteq(\mathcal{E}_{L}^{cl})^{\top}. It follows that (SOcl)Lcl(\mathcal{E}_{SO}^{cl})^{\top}\subseteq\mathcal{E}_{L}^{cl}.

On the other hand, for any (l1,l2)Lcl(l_{1},l_{2})\in\mathcal{E}_{L}^{cl}, we have (i,j)L(i,j)\in\mathcal{E}_{L} for some i𝒱l1i\in\mathcal{V}_{l_{1}} and j𝒱l2j\in\mathcal{V}_{l_{2}}. According to Lemma 3, jSORij\stackrel{{\scriptstyle\mathcal{E}_{SOR}}}{{\longrightarrow}}i. Hence, (l1,l2)(SORcl)(l_{1},l_{2})\in(\mathcal{E}_{SOR}^{cl})^{\top}.

(ii). By the virtue of statement (i), it suffices to show that 𝒢SOcl=𝒢SORcl\mathcal{G}_{SO}^{cl}=\mathcal{G}_{SOR}^{cl} if RSO\mathcal{E}_{R}\subseteq\mathcal{E}_{SO}, which is true due to the definition of 𝒢SOR\mathcal{G}_{SOR}.

(iii). Sufficiency. The condition implies that the reachability between any two vertices in 𝒢SO\mathcal{G}_{SO} is the same as that in 𝒢SOR\mathcal{G}_{SOR}. Therefore, SOcl=SORcl\mathcal{E}_{SO}^{cl}=\mathcal{E}_{SOR}^{cl}.

Necessity. Suppose that there exists an edge (i,j)R(i,j)\in\mathcal{E}_{R} such that jj is not reachable from ii in 𝒢SO\mathcal{G}_{SO}. Then ii and jj must belong to two different clusters l1l_{1} and l2l_{2}, respectively. It follows that (l1,l2)SOR(l_{1},l_{2})\in\mathcal{E}_{SOR} and (l1,l2)SO(l_{1},l_{2})\notin\mathcal{E}_{SO}, which contradicts with 𝒢SOcl=𝒢SORcl\mathcal{G}_{SO}^{cl}=\mathcal{G}_{SOR}^{cl}. \blacksquare

In the existing literature of MARL, it is common to see the assumption that iR={i}\mathcal{I}_{i}^{R}=\{i\}. In this scenario, R=SO\mathcal{E}_{R}=\varnothing\subseteq\mathcal{E}_{SO}, therefore it always holds that 𝒢Lcl=(𝒢SOcl)=(𝒢SORcl)\mathcal{G}_{L}^{cl}=(\mathcal{G}_{SO}^{cl})^{\top}=(\mathcal{G}_{SOR}^{cl})^{\top}.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” nature, vol. 550, no. 7676, pp. 354–359, 2017.
  • [3] A. S. Polydoros and L. Nalpantidis, “Survey of model-based reinforcement learning: Applications on robotics,” Journal of Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017.
  • [4] S. Mukherjee, A. Chakrabortty, H. Bai, A. Darvishi, and B. Fardanesh, “Scalable designs for reinforcement learning-based wide-area damping control,” IEEE Transactions on Smart Grid, vol. 12, no. 3, pp. 2389–2401, 2021.
  • [5] A. OroojlooyJadid and D. Hajinezhad, “A review of cooperative multi-agent deep reinforcement learning,” arXiv preprint arXiv:1908.03963, 2019.
  • [6] S. Gronauer and K. Diepold, “Multi-agent deep reinforcement learning: a survey,” Artificial Intelligence Review, pp. 1–49, 2021.
  • [7] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
  • [8] G. Qu, A. Wierman, and N. Li, “Scalable reinforcement learning of localized policies for multi-agent networked systems,” in Learning for Dynamics and Control.   PMLR, 2020, pp. 256–266.
  • [9] Y. Lin, G. Qu, L. Huang, and A. Wierman, “Multi-agent reinforcement learning in stochastic networked systems,” in Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • [10] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep decentralized multi-task multi-agent reinforcement learning under partial observability,” in International Conference on Machine Learning.   PMLR, 2017, pp. 2681–2690.
  • [11] S. Nayak, K. Choi, W. Ding, S. Dolan, K. Gopalakrishnan, and H. Balakrishnan, “Scalable multi-agent reinforcement learning through intelligent information aggregation,” in International Conference on Machine Learning.   PMLR, 2023, pp. 25 817–25 833.
  • [12] Y. Zhang and M. M. Zavlanos, “Cooperative multi-agent reinforcement learning with partial observations,” IEEE Transactions on Automatic Control, 2023, doi: 10.1109/TAC.2023.3288025.
  • [13] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar, “Fully decentralized multi-agent reinforcement learning with networked agents,” in International Conference on Machine Learning.   PMLR, 2018, pp. 5872–5881.
  • [14] C. Guestrin, M. Lagoudakis, and R. Parr, “Coordinated reinforcement learning,” in ICML, vol. 2.   Citeseer, 2002, pp. 227–234.
  • [15] J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement learning by payoff propagation,” Journal of Machine Learning Research, vol. 7, pp. 1789–1828, 2006.
  • [16] G. Jing, H. Bai, J. George, A. Chakrabortty, and P. K. Sharma, “Asynchronous distributed reinforcement learning for LQR control via zeroth-order block coordinate descent,” arXiv preprint arXiv:2107.12416, 2021.
  • [17] Y. Li, Y. Tang, R. Zhang, and N. Li, “Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach,” IEEE Transactions on Automatic Control, 2021.
  • [18] D. Görges, “Distributed adaptive linear quadratic control using distributed reinforcement learning,” IFAC-PapersOnLine, vol. 52, no. 11, pp. 218–223, 2019.
  • [19] G. Jing, H. Bai, J. George, and A. Chakrabortty, “Model-free optimal control of linear multi-agent systems via decomposition and hierarchical approximation,” IEEE Transactions on Control of Network Systems, 2021.
  • [20] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings.   Elsevier, 1994, pp. 157–163.
  • [21] S. Kar, J. M. Moura, and H. V. Poor, “𝒬𝒟{{\cal Q}{\cal D}}-learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus+innovations{\rm consensus}+{\rm innovations},” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1848–1862, 2013.
  • [22] S. V. Macua, A. Tukiainen, D. G.-O. Hernández, D. Baldazo, E. M. de Cote, and S. Zazo, “Diff-dac: Distributed actor-critic for average multitask deep reinforcement learning,” in Adaptive Learning Agents (ALA) Conference, 2018.
  • [23] R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup, “Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction,” in The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, 2011, pp. 761–768.
  • [24] A. Olshevsky, “Linear time average consensus on fixed graphs and implications for decentralized optimization and multi-agent control,” arXiv preprint arXiv:1411.4186, 2014.
  • [25] G. Jing, H. Bai, J. George, A. Chakrabortty, and P. K. Sharma, “Distributed cooperative multi-agent reinforcement learning with directed coordination graph,” in 2022 American Control Conference (ACC), to appear.   IEEE, 2022.
  • [26] G. Qu, Y. Lin, A. Wierman, and N. Li, “Scalable multi-agent reinforcement learning for networked systems with average reward,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [27] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls et al., “Value-decomposition networks for cooperative multi-agent learning,” arXiv preprint arXiv:1706.05296, 2017.
  • [28] T. Zhang, Y. Li, C. Wang, G. Xie, and Z. Lu, “Fop: Factorizing optimal joint policy of maximum-entropy multi-agent reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2021, pp. 12 491–12 500.
  • [29] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1467–1476.
  • [30] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. L. Bartlett, and M. J. Wainwright, “Derivative-free methods for policy optimization: Guarantees for linear quadratic systems,” Journal of Machine Learning Research, vol. 21, no. 21, pp. 1–51, 2020.
  • [31] D. Hajinezhad, M. Hong, and A. Garcia, “Zone: Zeroth-order nonconvex multiagent optimization over networks,” IEEE Transactions on Automatic Control, vol. 64, no. 10, pp. 3995–4010, 2019.
  • [32] C. Gratton, N. K. Venkategowda, R. Arablouei, and S. Werner, “Privacy-preserving distributed zeroth-order optimization,” arXiv preprint arXiv:2008.13468, 2020.
  • [33] Y. Tang, J. Zhang, and N. Li, “Distributed zero-order algorithms for nonconvex multi-agent optimization,” IEEE Transactions on Control of Network Systems, 2020.
  • [34] A. Akhavan, M. Pontil, and A. B. Tsybakov, “Distributed zero-order optimization under adversarial noise,” arXiv preprint arXiv:2102.01121, 2021.
  • [35] T. Chen, K. Zhang, G. B. Giannakis, and T. Basar, “Communication-efficient policy gradient methods for distributed reinforcement learning,” IEEE Transactions on Control of Network Systems, 2021.
  • [36] Y. Nesterov and V. Spokoiny, “Random gradient-free minimization of convex functions,” Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017.
  • [37] M. Pirotta, M. Restelli, and L. Bascetta, “Policy gradient in lipschitz markov decision processes,” Machine Learning, vol. 100, no. 2, pp. 255–283, 2015.
  • [38] H. Kumar, D. S. Kalogerias, G. J. Pappas, and A. Ribeiro, “Zeroth-order deterministic policy gradient,” arXiv preprint arXiv:2006.07314, 2020.
  • [39] A. Vemula, W. Sun, and J. Bagnell, “Contrasting exploration in parameter and action space: A zeroth-order optimization perspective,” in The 22nd International Conference on Artificial Intelligence and Statistics.   PMLR, 2019, pp. 2926–2935.
  • [40] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,” Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004.
  • [41] J. Bhandari and D. Russo, “Global optimality guarantees for policy gradient methods,” arXiv preprint arXiv:1906.01786, 2019.
  • [42] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan, “On the theory of policy gradient methods: Optimality, approximation, and distribution shift.” Journal of Machine Learning Research, vol. 22, no. 98, pp. 1–76, 2021.
  • [43] H. Feng and J. Lavaei, “On the exponential number of connected components for the feasible set of optimal decentralized control problems,” in 2019 American Control Conference (ACC).   IEEE, 2019, pp. 1430–1437.
  • [44] Y. Zhang, Y. Zhou, K. Ji, and M. M. Zavlanos, “A new one-point residual-feedback oracle for black-box learning and control,” Automatica, p. 110006, 2021.
  • [45] L. Xiao, S. Boyd, and S. Lall, “A scheme for robust distributed sensor fusion based on average consensus,” in IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005.   IEEE, 2005, pp. 63–70.
Gangshan Jing received the Ph.D. degree in Control Theory and Control Engineering from Xidian University, Xi’an, China, in 2018. From 2016-2017, he was a research assistant at Hong Kong Polytechnic University. From 2018 to 2019, he was a postdoctoral researcher at Ohio State University. From 2019 to 2021, he was a postdoctoral researcher at North Carolina State University. Since 2021 Dec., he has been an assistant professor in School of Automation, at Chongqing University. His research interests include control, optimization, and machine learning for network systems.
He Bai received his Ph.D. degree in Electrical Engineering from Rensselaer Polytechnic Institute, Troy, NY, in 2009. From 2009 to 2010, he was a postdoctoral researcher at Northwestern University, Evanston, IL. From 2010 to 2015, he was a Senior Research and Development Scientist at UtopiaCompression Corporation, Los Angeles, CA. In 2015, he joined the School of Mechanical and Aerospace Engineering at Oklahoma State University, Stillwater, OK, as an assistant professor. His research interests include distributed estimation, control and learning, reinforcement learning, nonlinear control, and robotics.
Jemin George received his M.S. (’07), and Ph.D. (’10) in Aerospace Engineering from the State University of New York at Buffalo. Prior to joining ARL in 2010, he worked at the U.S. Air Force Research Laboratory’s Space Vehicles Directorate and the National Aeronautics, and Space Administration’s Langley Aerospace Research Center. From 2014-2017, he was a Visiting Scholar at the Northwestern University, Evanston, IL. His principal research interests include decentralized/distributed learning, stochastic systems, control theory, nonlinear estimation/filtering, networked sensing and information fusion.
Aranya Chakrabortty received the Ph.D. degree in Electrical Engineering from Rensselaer Polytechnic Institute, NY in 2008. From 2008 to 2009 he was a postdoctoral research associate at University of Washington, Seattle, WA. From 2009 to 2010 he was an assistant professor at Texas Tech University, Lubbock, TX. Since 2010 he has joined the Electrical and Computer Engineering department at North Carolina State University, Raleigh, NC, where he is currently a Professor. His research interests are in all branches of control theory with applications to electric power systems. He received the NSF CAREER award in 2011.
Piyush Sharma received his M.S. and Ph.D. degrees in Applied Mathematics from the University of Puerto Rico and Delaware State University respectively. He has government and industry work experiences. Currently, he is with the U.S. Army as an AI Coordinator at ATEC, earlier a Computer Scientist at DEVCOM ARL. Prior to joining ARL, he worked at Infosys’ Data Analytics Unit (DNA) as a Senior Associate Data Scientist responsible for Thought Leadership and solving stakeholders’ problems.