This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DFAC Framework: Factorizing the Value Function via
Quantile Mixture for Multi-Agent Distributional Q-Learning

Wei-Fang Sun    Cheng-Kuang Lee    Chun-Yi Lee
Abstract

In fully cooperative multi-agent reinforcement learning (MARL) settings, the environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of the other agents. To address the above issues, we integrate distributional RL and value function factorization methods by proposing a Distributional Value Function Factorization (DFAC) framework to generalize expected value function factorization methods to their DFAC variants. DFAC extends the individual utility functions from deterministic variables to random variables, and models the quantile function of the total return as a quantile mixture. To validate DFAC, we demonstrate DFAC’s ability to factorize a simple two-step matrix game with stochastic rewards and perform experiments on all Super Hard tasks of StarCraft Multi-Agent Challenge, showing that DFAC is able to outperform expected value function factorization baselines.

Reinforcement Learning, Multi-Agent RL, Distributional RL

1 Introduction

In multi-agent reinforcement learning (MARL), one of the popular research directions is to enhance the training procedure of fully cooperative and decentralized agents. Examples of such agents include a fleet of unmanned aerial vehicles (UAVs), a group of autonomous cars, etc. This research direction aims to develop a decentralized and cooperative behavior policy for each agent, and is especially difficult for MARL settings without an explicit communication channel. The most straightforward approach is independent Q-learning (IQL) (Tan, 1993), where each agent is trained independently, with their behavior policies aimed to optimize the overall rewards in each episode. Nevertheless, each agent’s policy may not converge owing to two main difficulties: (1) non-stationary environments caused by the changing behaviors of the agents, and (2) spurious reward signals originated from the actions of the other agents. The agent’s partial observability of the environment further exacerbates the above issues. Therefore, in the past few years, a number of MARL researchers turned their attention to centralized training with decentralized execution (CTDE) approaches, with an objective to stabilize the training procedure while maintaining the agents’ abilities for decentralized execution (Oliehoek et al., 2016). Among these CTDE approaches, value function factorization methods (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019) are especially promising in terms of their superior performances and data efficiency (Samvelyan et al., 2019).

Value function factorization methods introduce the assumption of individual-global-max (IGM) (Son et al., 2019), which assumes that each agent’s optimal actions result in the optimal joint actions of the entire group. Based on IGM, the total return of a group of agents can be factorized into separate utility functions (Guestrin et al., 2001) (or simply ‘utility’ hereafter) for each agent. The utilities allow the agents to independently derive their own optimal actions during execution, and deliver promising performance in StarCraft Multi-Agent Challenge (SMAC) (Samvelyan et al., 2019). Unfortunately, current value function factorization methods only concentrate on estimating the expectations of the utilities, overlooking the additional information contained in the full return distributions. Such information, nevertheless, has been demonstrated beneficial for policy learning in the recent literature (Lyle et al., 2019).

In the past few years, distributional RL has been empirically shown to enhance value function estimation in various single-agent RL (SARL) domains (Bellemare et al., 2017; Dabney et al., 2018b, a; Rowland et al., 2019; Yang et al., 2019). Instead of estimating a single scalar Q-value, it approximates the probability distribution of the return by either a categorical distribution (Bellemare et al., 2017) or a quantile function (Dabney et al., 2018b, a). Even though the above methods may be beneficial to the MARL domain due to the ability to capture uncertainty, it is inherently incompatible to expected value function factorization methods (e.g., value decomposition network (VDN) (Sunehag et al., 2018) and QMIX (Rashid et al., 2018)). The incompatibility arises from two aspects: (1) maintaining IGM in a distributional form, and (2) factorizing the probability distribution of the total return into individual utilities. As a result, an effective and efficient approach that is able to solve the incompatibility is crucial and necessary for bridging the gap between value function factorization methods and distributional RL.

In this paper, we propose a Distributional Value Function Factorization (DFAC) framework, to efficiently integrate value function factorization methods with distributional RL. DFAC solves the incompatibility by two techniques: (1) Mean-Shape Decomposition and (2) Quantile Mixture. The former allows the generalization of expected value function factorization methods (e.g., VDN and QMIX) to their DFAC variants without violating IGM. The latter allows the total return distribution to be factorized into individual utility distributions in a computationally efficient manner. To validate the effectiveness of DFAC, we first demonstrate the ability of distribution factorization on a two-step matrix game with stochastic rewards. Then, we perform experiments on all Super Hard maps in SMAC. The experimental results show that DFAC offers beneficial impacts on the baseline methods in all Super Hard maps. In summary, the primary contribution is the introduction of DFAC for bridging the gap between distributional RL and value function factorization methods efficiently by mean-shape decomposition and quantile mixture.

2 Background and Related Works

In this section, we introduce the essential background material for understanding the contents of this paper. We first define the problem formulation of cooperative MARL and CTDE. Next, we describe the conventional formulation of IGM and the value function factorization methods. Then, we walk through the concepts of distributional RL, quantile function, as well as quantile regression, which are the fundamental concepts frequently mentioned in this paper. After that, we explain the implicit quantile network, a key approach adopted in this paper for approximating quantiles. Finally, we bring out the concept of quantile mixture, which is leveraged by DFAC for factorizing the return distribution.

2.1 Cooperative MARL and CTDE

In this work, we consider a fully cooperative MARL environment modeled as a decentralized and partially observable Markov Decision Process (Dec-POMDP) (Oliehoek & Amato, 2016) with stocastic rewards, which is described as a tuple 𝕊,𝕂,𝕆jt,𝕌jt,P,O,R,γ\langle\mathbb{S}{},\mathbb{K}{},\mathbb{O}_{\mathrm{jt}},\mathbb{U}_{\mathrm{jt}},P{},O{},R{},\gamma{}\rangle and is defined as follows:

  • 𝕊\mathbb{S}{} is the finite set of global states in the environment, where s𝕊s{}^{\prime}\in\mathbb{S}{} denotes the next state of the current state s𝕊s{}\in\mathbb{S}{}. The state information is optionally available during training, but not available to the agents during execution.

  • 𝕂={1,,K}\mathbb{K}{}=\{1,...,\mathrm{K}{}\} is the set of K\mathrm{K}{} agents. We use k𝕂k{}\in\mathbb{K}{} to denote the index of the agent.

  • 𝕆jt=Πk𝕂𝕆k\mathbb{O}_{\mathrm{jt}}=\Pi_{k{}\in\mathbb{K}{}}\mathbb{O}{}_{k{}} is the set of joint observations. At each timestep, a joint observation 𝐨=o,1oK𝕆jt\mathbf{o}{}=\langle o{}_{1},...o{}_{\mathrm{K}}{}\rangle\in\mathbb{O}_{\mathrm{jt}} is received. Each agent kk{} is only able to observe its individual observation ok𝕆ko{}_{k{}}\in\mathbb{O}{}_{k{}}.

  • jt=Πk𝕂k\mathbb{H}_{\mathrm{jt}}=\Pi_{k{}\in\mathbb{K}{}}\mathbb{H}{}_{k{}} is the set of joint action-observation histories. The joint history 𝐡=h,1hKjt\mathbf{h}{}=\langle h{}_{1},...h{}_{\mathrm{K}}{}\rangle\in\mathbb{H}_{\mathrm{jt}} concatenates all received observations and performed actions before a certain timestep, where hkkh{}_{k{}}\in\mathbb{H}{}_{k{}} represents the action-observation history from agent kk{}.

  • 𝕌jt=Πk𝕂𝕌k\mathbb{U}_{\mathrm{jt}}=\Pi_{k{}\in\mathbb{K}{}}\mathbb{U}{}_{k{}} is the set of joint actions. At each timestep, the entire group of the agents take a joint action 𝐮\mathbf{u}{}, where 𝐮=u,1,uK𝕌jt\mathbf{u}{}=\langle u{}_{1},...,u{}_{\mathrm{K}{}}\rangle\in\mathbb{U}_{\mathrm{jt}}. The individual action uk𝕌ku{}_{k{}}\in\mathbb{U}_{k{}} of each agent kk{} is determined based on its stochastic policy π(u|kh)kk:×k𝕌k[0,1]\pi{}_{k{}}(u{}_{k{}}|h{}_{k{}}):\mathbb{H}{}_{k{}}\times\mathbb{U}{}_{k{}}\rightarrow[0,1], expressed as ukπ(|h)kku{}_{k{}}\sim\pi{}_{k{}}(\cdot|h{}_{k{}}). Similarly, in single agent scenarios, we use uu and uu^{\prime} to denote the actions of the agent at state ss and ss^{\prime} under policy π\pi, respectively.

  • 𝕋={1,,T}\mathbb{T}{}=\{1,...,\mathrm{T}{}\} represents the set of timesteps with horizon T\mathrm{T}{}, where the index of the current timestep is denoted as t𝕋t{}\in\mathbb{T}{}. sts^{t}, oto^{t}, hth^{t}, and utu^{t} correspond to the environment information at timestep tt.

  • The transition function P(s|s,𝐮):𝕊×𝕌jt×𝕊[0,1]P{}(s{}^{\prime}|s{},\mathbf{u}{}):\mathbb{S}{}\times\mathbb{U}_{\mathrm{jt}}\times\mathbb{S}{}\rightarrow[0,1] specifies the state transition probabilities. Given ss and 𝐮\mathbf{u}, the next state is denoted as sP(|s,𝐮)s{}^{\prime}\sim P{}(\cdot|s{},\mathbf{u}{}).

  • The observation function O(𝐨|s):𝕆jt×𝕊[0,1]O{}({\mathbf{o}{}}|s{}):\mathbb{O}_{\mathrm{jt}}\times\mathbb{S}{}\rightarrow[0,1] specifies the joint observation probabilities. Given ss{}, the joint observation is represented as 𝐨O(|s)\mathbf{o}{}\sim O{}(\cdot|s{}).

  • R(r|s,𝐮):𝕊×𝕌jt×[0,1]R{}(r|s{},\mathbf{u}{}):\mathbb{S}{}\times\mathbb{U}_{\mathrm{jt}}\times\mathbb{R}\rightarrow[0,1] is the joint reward function shared among all agents. Given ss, the team reward is expressed as rR(|s,𝐮)r\sim R{}(\cdot|s{},\mathbf{u}{}).

  • γ\gamma{}\in\mathbb{R} is the discount factor with value within (0,1](0,1].

Under such an MARL formulation, this work concentrates on CTDE value function factorization methods, where the agents are trained in a centralized fashion and executed in a decentralized manner. In other words, the joint observation history 𝐡\mathbf{h}{} is available during the learning processes of individual policies [π]kk𝕂[\pi{}_{k{}}]_{k{}\in\mathbb{K}{}}. During execution, each agent’s policy πk\pi{}_{k{}} only conditions on its observation history hkh{}_{k{}}.

2.2 IGM and Factorizable Tasks

IGM is necessary for value function factorization (Son et al., 2019). For a joint action-value function Q(𝐡,𝐮)jt:jt×𝕌jtQ{}_{\mathrm{jt}{}}(\mathbf{h}{},\mathbf{u}{}):\mathbb{H}_{\mathrm{jt}}\times\mathbb{U}_{\mathrm{jt}}\rightarrow\mathbb{R}, if there exist K\mathrm{K}{} individual utility functions [Q(h,ku)kk:×k𝕌k]k𝕂[Q{}_{k}{}(h{}_{k}{},u{}_{k}{}):\mathbb{H}{}_{k{}}\times\mathbb{U}{}_{k{}}\rightarrow\mathbb{R}]_{k{}\in\mathbb{K}{}} such that the following condition holds:

argmax𝐮Q(𝐡,𝐮)jt=(argmaxu1Q(h,1u)11argmaxuKQ(h,Ku)KK),\arg\max_{\mathbf{u}}{}Q{}_{\mathrm{jt}{}}(\mathbf{h}{},\mathbf{u}{})=\begin{pmatrix}\arg\max_{u{}_{1}}Q{}_{1}(h{}_{1},u{}_{1})\\ \vdots\\ \arg\max_{u{}_{\mathrm{K}}{}}Q{}_{\mathrm{K}}{}(h{}_{\mathrm{K}}{},u{}_{\mathrm{K}}{})\end{pmatrix}, (1)

then [Q]kk𝕂[Q{}_{k}{}]_{k{}\in\mathbb{K}{}} are said to satisfy IGM for QjtQ{}_{\mathrm{jt}{}} under 𝐡\mathbf{h}{}. In this case, we also say that Q(𝐡,𝐮)jtQ{}_{\mathrm{jt}{}}(\mathbf{h}{},\mathbf{u}{}) is factorized by [Q(h,ku)kk]k𝕂[Q{}_{k}{}(h{}_{k}{},u{}_{k}{})]_{k{}\in\mathbb{K}{}} (Son et al., 2019). If QjtQ{}_{\mathrm{jt}{}} in a given task is factorizable under all 𝐡jt\mathbf{h}{}\in\mathbb{H}_{\mathrm{jt}}, we say that the task is factorizable. Intuitively, factorizable tasks indicate that there exists a factorization such that each agent can select the greedy action according to their individual utilities [Q]kk𝕂[Q{}_{k}{}]_{k{}\in\mathbb{K}{}} independently in a decentralized fashion. This enables the optimal individual actions to implicitly achieve the optimal joint action across the K\mathrm{K}{} agents. Since there is no individual reward, the factorized utilities do not estimate expected returns on their own (Guestrin et al., 2001) and are different from the value function definition commonly used in SARL.

2.3 Value Function Factorization Methods

Based on IGM, value function factorization methods enable centralized training for factorizable tasks, while maintaining the ability for decentralized execution. In this work, we consider two such methods, VDN and QMIX, which can solve a subset of factorizable tasks that satisfies Additivity (Eq. (2)) and Monotonicity (Eq. (3)), respectively, given by:

Q(𝐡,𝐮)jt=k=1KQ(h,ku)kk,Q{}_{\mathrm{jt}{}}(\mathbf{h}{},\mathbf{u}{})=\sum^{\mathrm{K}{}}_{k{}=1}Q{}_{k}{}(h{}_{k}{},u{}_{k}{}), (2)
Q(𝐡,𝐮)jt=M(Q(h,1u)11,,Q(h,Ku)KK|s),Q{}_{\mathrm{jt}{}}(\mathbf{h}{},\mathbf{u}{})=M(Q{}_{1}(h{}_{1},u{}_{1}),...,Q{}_{\mathrm{K}}{}(h{}_{\mathrm{K}}{},u{}_{\mathrm{K}}{})|s), (3)

where MM is a monotonic function that satisfies MQk0,k𝕂\frac{\partial M}{\partial Q{}_{k}{}}\geq 0,\forall k{}\in\mathbb{K}{}, and conditions on the state ss if the information is available during training. Either of these two equation is a sufficient condition for IGM (Son et al., 2019).

2.4 Distributional RL

For notational simplicity, we consider a degenerated case with only a single agent, and the environment is fully observable until the end of Section 2.6. Distributional RL generalizes classic expected RL methods by capturing the full return distribution Z(s,u)Z{}(s{},u{}) instead of the expected return Q(s,u)Q{}(s{},u{}), and outperforms expected RL methods in various single-agent RL domains (Bellemare et al., 2017, 2019; Dabney et al., 2018b, a; Rowland et al., 2019; Yang et al., 2019). Moreover, distributional RL enables improvements (Nikolov et al., 2019; Zhang & Yao, 2019; Mavrin et al., 2019) that require the information of the full return distribution. We define the distributional Bellman operator 𝒯π\mathcal{T}^{\pi} as follows:

𝒯πZ(s,u):=DR(s,u)+γZ(s,u),\mathcal{T}^{\pi}Z{}(s{},u{})\stackrel{{\scriptstyle D}}{{:=}}R{}(s{},u{})+\gamma Z{}(s{}^{\prime},u{}^{\prime}), (4)

and the distributional Bellman optimality operator 𝒯\mathcal{T}^{*} as:

𝒯Z(s,u):=DR(s,u)+γZ(s,u),\mathcal{T}^{*}Z{}(s{},u{})\stackrel{{\scriptstyle D}}{{:=}}R{}(s{},u{})+\gamma Z{}(s{}^{\prime},u{}^{\prime*}), (5)

where u=argmaxu𝔼[Z(s,u)]u{}^{\prime*}=\arg\max_{u^{\prime}}\mathbb{E}[Z{}(s{}^{\prime},u^{\prime})] is the optimal action at state ss{}^{\prime}, and the expression X=DYX\stackrel{{\scriptstyle D}}{{=}}Y denotes that random variable XX and YY follow the same distribution. Given some initial distribution Z0Z_{0}, ZZ{} converges to the return distribution ZπZ{}^{\pi} under π\pi, contracting in terms of pp-Wasserstein distance for all p[1,)p\in[1,\infty) by applying 𝒯π\mathcal{T}^{\pi} repeatedly; while ZZ{} alternates between the optimal return distributions in the set 𝒵:={Zπ:πΠ}\mathcal{Z{}}^{*}:=\{Z^{\pi^{*}}:\pi^{*}\in\Pi^{*}\}, under the set of optimal policies Π\Pi^{*} by repeatedly applying 𝒯\mathcal{T}^{*} (Bellemare et al., 2017). The pp-Wasserstein distance WpW_{p} between the probability distributions of random variables XX, YY is given by:

Wp(X,Y)=(01|FX1(ω)FY1(ω)|pdω)1/p,W_{p}(X,Y)=\left(\int_{0}^{1}|F^{-1}_{X}(\omega)-F^{-1}_{Y}(\omega)|^{p}\mathrm{d}\omega\right)^{1/p}, (6)

where (FX1,FY1)(F^{-1}_{X},F^{-1}_{Y}) are quantile functions of (X,Y)(X,Y).

2.5 Quantile Function and Quantile Regression

The relationship between the cumulative distribution function (CDF) FXF_{X} and the quantile function FX1F^{-1}_{X} (the generalized inverse CDF) of random variable XX is formulated as:

FX1(ω)=inf{x:ωF(x)X},ω[0,1].F^{-1}_{X}(\omega{})=\inf\{x\in\mathbb{R}:\omega{}\leq F{}_{X}(x)\},\forall\omega\in[0,1]. (7)

The expectation of XX expressed in terms of FX1(ω)F^{-1}_{X}(\omega{}) is:

𝔼[X]=01FX1(ω)dω.\mathbb{E}[X]=\int_{0}^{1}F^{-1}_{X}(\omega{})\ \mathrm{d}\omega. (8)

In (Dabney et al., 2018b), the authors model the value function as a quantile function F1(s,u|ω)F^{-1}{}(s{},u{}|\omega{}). During optimization, a pair-wise sampled temporal difference (TD) error δ\delta for two quantile samples ω,ωU([0,1])\omega{},\omega{}^{\prime}\sim U([0,1]) is defined as:

δtω,ω=r+γF1(s,u|ω)F1(s,u|ω).\delta_{t{}}^{\omega{},\omega{}^{\prime}}=r{}+\gamma F^{-1}{}(s{}^{\prime},u{}^{\prime}|\omega{}^{\prime})-F^{-1}{}(s{},u{}|\omega{}). (9)

The pair-wise loss ρωκ\rho{}^{\kappa}_{\omega{}} is then defined based on the Huber quantile regression loss κ\mathcal{L}_{\kappa}{} (Dabney et al., 2018b) with threshold κ=1\kappa=1, and is formulated as follows:

ρ(δω,ω)ωκ=|ω𝕀{δω,ω<0}|κ(δω,ω)κ, with\rho{}^{\kappa}_{\omega{}}(\delta^{\omega{},\omega{}^{\prime}})=|\omega{}-\mathbb{I}\{\delta^{\omega{},\omega{}^{\prime}}<0\}|\frac{\mathcal{L}_{\kappa}{}(\delta^{\omega{},\omega{}^{\prime}})}{\kappa}\text{, with} (10)
κ(δω,ω)={12(δω,ω)2,if |δω,ω|κκ(|δω,ω|12κ),otherwise.\mathcal{L}_{\kappa}{}(\delta^{\omega{},\omega{}^{\prime}})=\begin{cases}\frac{1}{2}(\delta^{\omega{},\omega{}^{\prime}})^{2},&\text{if }|\delta^{\omega{},\omega{}^{\prime}}|\leq\kappa\\ \kappa(|\delta^{\omega{},\omega{}^{\prime}}|-\frac{1}{2}\kappa),&\text{otherwise}\end{cases}. (11)

Given N\mathrm{N}{} quantile samples [ω]ii=1N[\omega{}_{i}]^{\mathrm{N}}_{i=1} to be optimized with regard to N\mathrm{N}^{\prime} target quantile samples [ω]jj=1N[\omega{}_{j}]^{\mathrm{N}^{\prime}}_{j=1}, the total loss (s,u,r,s)\mathcal{L}{}(s{},u{},r{},s{}^{\prime}) is defined as the sum of the pair-wise losses, and is expressed as the following:

(s,u,r,s)=1Ni=1Nj=1Nρ(δω,iωj)ωiκ.\mathcal{L}{}(s{},u{},r{},s{}^{\prime})=\frac{1}{\mathrm{N}^{\prime}{}}\sum^{\mathrm{N}{}}_{i=1}\sum^{\mathrm{N}^{\prime}{}}_{j=1}\rho{}^{\kappa}_{\omega{}_{i}}(\delta^{\omega{}_{i},\omega{}_{j}^{\prime}}). (12)

2.6 Implicit Quantile Network

Implicit quantile network (IQN) (Dabney et al., 2018a) is relatively light-weight when it is compared to other distributional RL methods. It approximates the return distribution Z(s,u)Z{}(s{},u{}) by an implicit quantile function F1(s,u|ω)=g(ψ(s),ϕ(ω))uF^{-1}{}(s{},u{}|\omega{})=g{}(\psi{}(s{}),\phi{}(\omega{}))_{u}{} for some differentiable functions gg{}, ψ\psi, and ϕ\phi. Such an architecture is a type of universal value function approximator (UVFA) (Schaul et al., 2015), which generalizes its predictions across states s𝕊s{}\in\mathbb{S}{} and goals ω[0,1]\omega{}\in[0,1], with the goals defined as different quantiles of the return distribution. In practice, ϕ\phi{} first expands the scalar ω\omega to an nn-dimensional vector by [cos(πiω)]i=0n1[\cos(\pi i\omega)]^{\mathrm{n}{}-1}_{i=0}, followed by a single hidden layer with weights [wij][w_{ij}] and biases [bj][b_{j}] to produce a quantile embedding ϕ(ω)=[ϕ(ω)j]j=0dim(ϕ(ω))1\phi{}(\omega{})=[\phi(\omega{})_{j}]^{\text{dim}(\phi{}(\omega{}))-1}_{j=0}. The expression of ϕ(ω)j\phi(\omega{})_{j} can be represented as the following:

ϕ(ω)j:=ReLU(i=0n1cos(πiω)wij+bj),\phi(\omega{})_{j}:=\text{ReLU}(\sum_{i=0}^{\mathrm{n}{}-1}\cos(\pi i\omega)w_{ij}+b_{j}), (13)

where n=64\mathrm{n}{}=64 and dim(ϕ(ω))=dim(ψ(s))\text{dim}(\phi(\omega{}))=\text{dim}(\psi(s{})). Then, ϕ(ω)\phi{}(\omega{}) is combined with the state embedding ψ(s)\psi{}(s{}) by the element-wise (Hadamard) product (\odot), expressed as g:=ψϕg:=\psi{}\odot\phi{}. The loss of IQN is defined as Eq. (12) by sampling a batch of N\mathrm{N}{} and N\mathrm{N}^{\prime}{} quantiles from the policy network and the target network respectively. During execution, the action with the largest expected return Q(s,u)Q{}(s{},u{}) is chosen. Since IQN does not model the expected return explicitly, Q(s,u)Q{}(s{},u{}) is approximated by calculating the mean of the sampled return through N^\hat{\mathrm{N}}{} quantile samples ω^iU([0,1]),i[1,N^]\hat{\omega}{}_{i}\sim U([0,1]),\forall i\in[1,\hat{\mathrm{N}}{}] based on Eq. (8), expressed as follows:

Q(s,u)=01F1(s,u|ω)dω1N^i=1N^F1(s,u|ω^)i.Q{}(s{},u{})=\int_{0}^{1}F^{-1}{}(s{},u{}|\omega{})\ \mathrm{d}\omega\approx\frac{1}{\hat{\mathrm{N}}{}}\sum_{i=1}^{\hat{\mathrm{N}}{}}F^{-1}{}(s{},u{}|\hat{\omega}{}_{i}). (14)

2.7 Quantile Mixture

Multiple quantile functions (e.g., IQNs) sharing the same quantile ω\omega{} may be combined into a single quantile function F1(ω)F^{-1}{}(\omega{}), in a form of quantile mixture expressed as follows:

F1(ω)=k=1KβFk1k(ω),F^{-1}{}(\omega{})=\sum^{\mathrm{K}{}}_{k=1}\beta{}_{k{}}F^{-1}_{k{}}(\omega{}), (15)

where [Fk1(ω)]k𝕂[F^{-1}_{k{}}(\omega{})]_{k{}\in\mathbb{K}{}} are quantile functions, and [β]kk𝕂[\beta{}_{k{}}]_{k{}\in\mathbb{K}{}} are model parameters (Karvanen, 2006). The condition for [β]kk𝕂[\beta{}_{k{}}]_{k{}\in\mathbb{K}{}} is that F1(ω)F^{-1}{}(\omega{}) must satisfy the properties of a quantile function. The concept of quantile mixture is analogous to the mixture of multiple probability density functions (PDFs), expressed as follows:

f(x)=k=1Kαfk(x)k,f{}(x)=\sum^{\mathrm{K}{}}_{k{}=1}\alpha{}_{k{}}f{}_{k{}}(x), (16)

where [f(x)k]k𝕂[f{}_{k{}}(x)]_{k{}\in\mathbb{K}{}} are PDFs, k=1Kα=k1\sum^{\mathrm{K}{}}_{k{}=1}\alpha{}_{k{}}=1, and αk0\alpha{}_{k{}}\geq 0.

3 Methodology

In this section, we walk through the proposed DFAC framework and its derivation procedure. We first discuss a naive distributional factorization and its limitation in Section 3.1. Then, we introduce the DFAC framework to address the limitation, and show that DFAC is able to generalize distributional RL to all factorizable tasks in Section 3.2. After that, DDN and DMIX are introduced as the DFAC variants of VDN and QMIX, respectively, in Section 3.4. Finally, a practical implementation of DFAC based on quantile mixture is presented in Section 3.3. All proofs of the theorems in this section are provided in the supplementary material.

3.1 Distributional IGM

Since IGM is necessary for value function factorization, a distributional factorization that satisfies IGM is required for factorizing stochastic value functions. We first discuss a naive distributional factorization that simply replaces deterministic utilities QQ with stochastic utilities ZZ. Then, we provide a theorem to show that the naive distributional factorization is insufficient to guarantee the IGM condition.

Definition 1 (Distributional IGM).

A finite number of individual stochastic utilities [Zk(hk,uk)]k𝕂[Z_{k}(h_{k},u_{k})]_{k\in\mathbb{K}}, are said to satisfy Distributional IGM (DIGM) for a stochastic joint action-value function Zjt(𝐡,𝐮)Z_{\mathrm{jt}}(\mathbf{h},\mathbf{u}{}) under 𝐡\mathbf{h}, if [𝔼[Zk(hk,uk)]]k𝕂[\mathbb{E}[Z_{k}(h_{k},u_{k})]]_{k\in\mathbb{K}} satisfy IGM for 𝔼[Zjt(𝐡,𝐮)]\mathbb{E}[Z_{\mathrm{jt}}(\mathbf{h},\mathbf{u}{})] under 𝐡\mathbf{h}, represented as:

argmax𝐮𝔼[Z(𝐡,𝐮)jt]=(argmaxu1𝔼[Z(h,1u)11]argmaxuK𝔼[Z(h,Ku)KK]).\arg\max_{\mathbf{u}}{}\mathbb{E}[Z{}_{\mathrm{jt}{}}(\mathbf{h}{},\mathbf{u}{})]=\begin{pmatrix}\arg\max_{u{}_{1}}\mathbb{E}[Z{}_{1}(h{}_{1},u{}_{1})]\\ \vdots\\ \arg\max_{u{}_{\mathrm{K}}{}}\mathbb{E}[Z{}_{\mathrm{K}}{}(h{}_{\mathrm{K}}{},u{}_{\mathrm{K}}{})]\end{pmatrix}.{}
Theorem 1.

Given a deterministic joint action-value function QjtQ_{\mathrm{jt}}, a stochastic joint action-value function ZjtZ_{\mathrm{jt}}, and a factorization function Ψ\Psi for deterministic utilities:

Q(𝐡,𝐮)jt=Ψ(Q(h,1u)11,,Q(h,Ku)KK|s),Q{}_{\mathrm{jt}{}}(\mathbf{h}{},{\mathbf{u}{}})=\Psi(Q{}_{1}(h{}_{1},u{}_{1}),...,Q{}_{\mathrm{K}{}}(h{}_{\mathrm{K}{}},u{}_{\mathrm{K}{}})|s),{}

such that [Q]kk𝕂[Q{}_{k}{}]_{k{}\in\mathbb{K}{}} satisfy IGM for QjtQ{}_{\mathrm{jt}{}} under 𝐡\mathbf{h}{}, the following distributional factorization:

Z(𝐡,𝐮)jt=Ψ(Z(h,1u)11,,Z(h,Ku)KK|s).Z{}_{\mathrm{jt}{}}(\mathbf{h}{},{\mathbf{u}{}})=\Psi(Z{}_{1}(h{}_{1},u{}_{1}),...,Z{}_{\mathrm{K}{}}(h{}_{\mathrm{K}{}},u{}_{\mathrm{K}{}})|s).{}

is insufficient to guarantee that [Z]kk𝕂[Z{}_{k}{}]_{k{}\in\mathbb{K}{}} satisfy DIGM for ZjtZ{}_{\mathrm{jt}{}} under 𝐡\mathbf{h}{}.

In order to satisfy DIGM for stochastic utilities, an alternative factorization strategy is necessary.

3.2 The Proposed DFAC Framework

We propose Mean-Shape Decomposition and the DFAC framework to ensure that DIGM is satisfied for stochastic utilities.

Definition 2 (Mean-Shape Decomposition).

A given random variable ZZ can be decomposed as follows:

Z=𝔼[Z]+(Z𝔼[Z])=Zmean+Zshape,\begin{split}Z&=\mathbb{E}[Z]+(Z-\mathbb{E}[Z])\\ &=Z_{\mathrm{mean}}+Z_{\mathrm{shape}}\ ,\end{split}{}

where Var(Zmean)=0\mathrm{Var}(Z_{\mathrm{mean}})=0 and 𝔼[Zshape]=0\mathbb{E}[Z_{\mathrm{shape}}]=0.

We propose DFAC to decompose a joint return distribution ZjtZ{}_{\mathrm{jt}{}} into its deterministic part ZmeanZ{}_{\text{mean}} (i.e., expected value) and stochastic part ZshapeZ{}_{\text{shape}} (i.e., higher moments), which are approximated by two different functions Ψ\Psi and Φ\Phi, respectively. The factorization function Ψ\Psi is required to precisely factorize the expectation of ZjtZ{}_{\mathrm{jt}{}} in order to satisfy DIGM. On the other hand, the shape function Φ\Phi is allowed to roughly factorize the shape of ZjtZ{}_{\mathrm{jt}{}}, since the main objective of modeling the return distribution is to assist non-linear approximation of the expectation of ZjtZ{}_{\mathrm{jt}{}} (Lyle et al., 2019), rather than accurately model the shape of ZjtZ{}_{\mathrm{jt}{}}.

Theorem 2 (DFAC Theorem).

Given a deterministic joint action-value function QjtQ_{\mathrm{jt}}, a stochastic joint action-value function ZjtZ_{\mathrm{jt}}, and a factorization function Ψ\Psi for deterministic utilities:

Q(𝐡,𝐮)jt=Ψ(Q(h,1u)11,,Q(h,Ku)KK|s),Q{}_{\mathrm{jt}{}}(\mathbf{h}{},{\mathbf{u}{}})=\Psi(Q{}_{1}(h{}_{1},u{}_{1}),...,Q{}_{\mathrm{K}{}}(h{}_{\mathrm{K}{}},u{}_{\mathrm{K}{}})|s),{}

such that [Q]kk𝕂[Q{}_{k}{}]_{k{}\in\mathbb{K}{}} satisfy IGM for QjtQ{}_{\mathrm{jt}{}} under 𝐡\mathbf{h}{}, by Mean-Shape Decomposition, the following distributional factorization:

Z(𝐡,𝐮)jt=𝔼[Z(𝐡,𝐮)jt]+(Z(𝐡,𝐮)jt𝔼[Z(𝐡,𝐮)jt])=Z(𝐡,𝐮)mean+Z(𝐡,𝐮)shape=Ψ(Q1(h,1u)1,,QK(h,Ku)K|s)+Φ(Z1(h,1u)1,,ZK(h,Ku)K|s).\begin{split}Z{}_{\mathrm{jt}{}}(\mathbf{h}{},{\mathbf{u}{}})&=\mathbb{E}[Z{}_{\mathrm{jt}{}}(\mathbf{h}{},{\mathbf{u}{}})]+(Z{}_{\mathrm{jt}{}}(\mathbf{h}{},{\mathbf{u}{}})-\mathbb{E}[Z{}_{\mathrm{jt}{}}(\mathbf{h}{},{\mathbf{u}{}})])\\ &=Z{}_{\mathrm{mean}}(\mathbf{h}{},{\mathbf{u}{}})+Z{}_{\mathrm{shape}}(\mathbf{h}{},{\mathbf{u}{}})\\ &=\Psi{}(Q_{1}(h{}_{1},u{}_{1}),...,Q_{\mathrm{K}{}}(h{}_{\mathrm{K}{}},u{}_{\mathrm{K}{}})|s)\\ &+\Phi{}(Z_{1}(h{}_{1},u{}_{1}),...,Z_{\mathrm{K}{}}(h{}_{\mathrm{K}{}},u{}_{\mathrm{K}{}})|s).\end{split}{}

is sufficient to guarantee that [Z]kk𝕂[Z{}_{k}{}]_{k{}\in\mathbb{K}{}} satisfy DIGM for ZjtZ{}_{\mathrm{jt}{}} under 𝐡\mathbf{h}{}, where Var(Ψ)=0\mathrm{Var}(\Psi)=0 and 𝔼[Φ]=0\mathbb{E}[\Phi]=0.

Theorem. 2 reveals that the choice of Ψ\Psi determines whether IGM holds, regardless of the choice of Φ\Phi, as long as 𝔼[Φ]=0\mathbb{E}[\Phi]=0. Under this setting, any differentiable factorization function of deterministic variables can be extended to a factorization function of random variables. Such a decomposition enables approximation of joint distributions for all factorizable tasks under appropriate choices of Ψ\Psi{} and Φ\Phi{}.

Refer to caption
Figure 1: The DFAC framework consists of a factorization network Ψ\Psi{} and a shape network Φ\Phi for decomposing the deterministic part ZmeanZ{}_{\text{mean}} (i.e., QjtQ{}_{\mathrm{jt}{}}) and the stochastic part ZshapeZ{}_{\text{shape}} of the total return distribution ZjtZ{}_{\mathrm{jt}{}}, as described in Theorem 2. The shape network contains parameter networks Λstate(s;ω)\Lambda_{\text{state}}(s{};\omega{}) and [Λk(s)]k𝕂[\Lambda_{k}(s{})]_{k{}\in\mathbb{K}{}} for generating Zstate(s)Z_{\text{state}}(s{}) and β(s)k\beta{}_{k}(s).

3.3 A Practical Implementation of DFAC

In this section, we provide a practical implementation of the shape function Φ\Phi in DFAC, effectively extending any differentiable factorization function Ψ\Psi{} (e.g., the additive function of VDN, the monotonic mixing network of QMIX, etc.) that satisfies the IGM condition into its DFAC variant.

Theoretically, the sum of random variables appeared in DDN and DMIX can be described precisely by a joint CDF. However, the exact derivation of this joint CDF is usually computationally expensive and impractical (Lin et al., 2019). As a result, DFAC utilizes the property of quantile mixture to approximate the shape function Φ\Phi{} in O(KN)\mathrm{O}(\mathrm{K}\mathrm{N}) time.

Theorem 3.

Given a quantile mixture:

F1(ω)=k=1KβFk1k(ω)F^{-1}(\omega{})=\sum^{\mathrm{K}{}}_{k=1}\beta{}_{k{}}F^{-1}_{k{}}(\omega{}){}

with K\mathrm{K}{} components [Fk1]k𝕂[F^{-1}_{k{}}]_{k{}\in\mathbb{K}{}} and non-negative model parameters [β]kk𝕂[\beta{}_{k{}}]_{k{}\in\mathbb{K}{}}. There exist a set of random variables ZZ{} and [Zk]k𝕂[Z_{k{}}]_{k{}\in\mathbb{K}{}} corresponding to the quantile functions F1F^{-1} and [Fk1]k𝕂[F^{-1}_{k{}}]_{k{}\in\mathbb{K}{}}, respectively, with the following relationship:

Z=k𝕂βZk.kZ{}=\sum_{k\in\mathbb{K}{}}\beta{}_{k{}}Z{}_{k{}}.{}

Based on Theorem 3, the quantile function Fshape1F^{-1}_{\mathrm{shape}} of ZshapeZ_{\mathrm{shape}} in DFAC can be approximated by the following:

Fshape1(𝐡,𝐮|ω)=Fstate1(s|ω)+k𝕂β(s)k(Fk1(h,ku|kω)Qk(h,ku)k),\begin{split}F^{-1}_{\mathrm{shape}}(\mathbf{h}{},{\mathbf{u}{}}|\omega{})=\ &F^{-1}_{\text{state}}(s|\omega{})\\ +\sum_{k\in\mathbb{K}{}}\beta{}_{k}(s)(&F^{-1}_{k}(h{}_{k{}},u{}_{k{}}|\omega{})-Q_{k}(h{}_{k{}},u{}_{k{}})),\end{split} (17)

where Fstate1(s|ω)F^{-1}_{\text{state}}(s{}|\omega{}) and [β(s)k]k𝕂[\beta{}_{k}(s)]_{k{}\in\mathbb{K}{}} are respectively generated by function approximators Λstate(s|ω)\Lambda_{\text{state}}(s{}|\omega{}) and [Λk(s)]k𝕂[\Lambda_{k}(s{})]_{k{}\in\mathbb{K}{}}, satisfying constraints β(s)k0,k𝕂\beta{}_{k}(s)\geq 0,\forall k{\in\mathbb{K}{}} and 01Fstate1(s|ω)dω=0\int_{0}^{1}F^{-1}_{\text{state}}(s|\omega{})\ \mathrm{d}\omega=0. The term Fstate1F^{-1}_{\text{state}} models the shape of an additional state-dependent utility (introduced by QMIX at the last layer of the mixing network), which extends the state-dependent bias in QMIX to a full distribution. The full network architecture of DFAC is illustrated in Fig. 1.

This transformation enables DFAC to decompose the quantile representation of a joint distribution into the quantile representations of individual utilities. In this work, Φ\Phi{} is implemented by a large IQN composed of multiple IQNs, optimized through the loss function defined in Eq. (12).

3.4 DFAC Variant of VDN and QMIX

In order to validate the proposed DFAC framework, we next discuss the DFAC variants of two representative factorization methods: VDN and QMIX. DDN extends VDN to its DFAC variant, expressed as:

Zjt=k𝕂Qk+k𝕂(ZkQk), given\begin{split}Z_{\mathrm{jt}}=\sum_{k\in\mathbb{K}{}}Q_{k}+\sum_{k\in\mathbb{K}{}}(Z_{k}-Q_{k}),\text{ given}\end{split} (18)

Zmean=k𝕂QkZ_{\mathrm{mean}}=\sum_{k\in\mathbb{K}{}}Q_{k}, Zshape=k𝕂(ZkQk)Z_{\mathrm{shape}}=\sum_{k\in\mathbb{K}{}}(Z_{k}-Q_{k}); while DMIX extends QMIX to its DFAC variant, expressed as:

Zjt=M(Q1,,QK|s)+k𝕂(ZkQk), givenZ_{\mathrm{jt}}=M{}(Q_{1},...,Q_{\mathrm{K}}|s)+\sum_{k\in\mathbb{K}{}}(Z_{k}-Q_{k}),\text{ given} (19)

Zmean=M(Q1,,QK|s)Z_{\mathrm{mean}}=M{}(Q_{1},...,Q_{\mathrm{K}}|s), Zshape=k𝕂(ZkQk)Z_{\mathrm{shape}}=\sum_{k\in\mathbb{K}{}}(Z_{k}-Q_{k}).

Both DDN and DMIX choose Fstate1=0F^{-1}_{\text{state}}=0 and [β=k1]k𝕂[\beta{}_{k}=1]_{k{}\in\mathbb{K}{}} for simplicity. Automatically learning the values of Fstate1F^{-1}_{\text{state}} and [β]kk𝕂[\beta{}_{k}]_{k{}\in\mathbb{K}{}} is proposed as future work.

4 A Stochastic Two-Step Game

In the previous expected value function factorization methods (e.g., VDN, QMIX, etc.), the factorization is achieved by modeling QjtQ{}_{\mathrm{jt}{}} and [Q]kk𝕂[Q{}_{k{}}]_{k{}\in\mathbb{K}{}} as deterministic variables, overlooking the information of higher moments in the full return distributions ZjtZ{}_{\mathrm{jt}{}} and [Z]kk𝕂[Z{}_{k{}}]_{k{}\in\mathbb{K}{}}. In order to demonstrate DFAC’s ability of factorization, we begin with a toy example modified from (Rashid et al., 2018) to show that DFAC is able to approximate the true return distributions, and factorize the mean and variance of the approximated total return ZjtZ{}_{\mathrm{jt}{}} into utilities [Z]kk𝕂[Z{}_{k{}}]_{k{}\in\mathbb{K}{}}. Table 2 illustrates the flow of a two-step game consisting of two agents and three states 1, 2A, and 2B, where State 1 serves as the initial state, and each agent is able to perform an action from {A,B}\{A,B\} in each step. In the first step (i.e., State 1), the action of agent 11 (i.e., actions A1A_{1} or B1B_{1}) determines which of the two matrix games (State 2A or State 2B) to play in the next step, regardless of the action performed by agent 22 (i.e., actions A2A_{2} or B2B_{2}). For all joint actions performed in the first step, no reward is provided to the agents. In the second step, both agents choose an action and receive a global reward according to the payoff matrices depicted in Table 2, where the global rewards are sampled from a normal distribution 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) with mean μ\mu and standard deviation σ\sigma. The hyperparameters of the two-step game are offered in the supplementary material in detail.

Table 2 presents the learned factorization of DMIX for each state after convergence, where the first rows and the first columns of the tables correspond to the factorized distributions of the individual utilities (i.e., Z1Z_{1} and Z2Z_{2}), and the main content cells of them correspond to the joint return distributions (i.e., ZjtZ_{\mathrm{jt}}). From Tables 2(b) and 2(c), it is observed that no matter the true returns are deterministic (i.e., State 2A) or stochastic (i.e., State 2B), DMIX is able to approximate the true returns in Table 2 properly, which are not achievable by expected value function factorization methods. The results demonstrate DFAC’s ability to factorize the joint return distribution rather than expected returns. DMIX’s ability to reconstruct the optimal joint policy in the two-step game further shows that DMIX can represent the same set of tasks as QMIX.

To further illustrate DFAC’s capability of factorization, Figs. 2(a)-2(b) visualize the factorization of the joint action B1,B2\langle B_{1},B_{2}\rangle in State 22A and B1,B2\langle B_{1},B_{2}\rangle in State 22B, respectively. As IQN approximates the utilities Z1Z{}_{1} and Z2Z{}_{2} implicitly, Z1Z{}_{1}, Z2Z{}_{2}, and ZjtZ{}_{\mathrm{jt}{}} can only be plotted in terms of samples. ZjtZ{}_{\mathrm{jt}{}} in Fig. 2(a) shows that DMIX degenerates to QMIX when approximating deterministic returns (i.e., 𝒩(7,0)\mathcal{N}(7,0)), while ZjtZ{}_{\mathrm{jt}{}} in Fig. 2(b) exhibits DMIX’s ability to capture the uncertainty in stochastic returns (i.e., 𝒩(8,29)\mathcal{N}(8,29)).

Table 1: An illustration of the flow of the stochastic two-step game. Each agent is able to perform an action from {A,B}\{A,B\} in each step, with a subscript denoting the agent index. In the first step, action A1A_{1} takes the agents from the initial State 1 to State 2A, while action B1B_{1} takes them to State 2B instead. The transitions from State 1 to State 2A or State 2B yield zero reward. In the second step, the global rewards are sampled from the normal distributions defined in the payoff matrices. State 11A1A_{1}B1B_{1} Agent 22 A2A_{2} B2B_{2} Agent 11 A1A_{1} 𝒩(7,0)\mathcal{N}(7,0) 𝒩(7,0)\mathcal{N}(7,0) B1B_{1} 𝒩(7,0)\mathcal{N}(7,0) 𝒩(7,0)\mathcal{N}(7,0) State 22A Agent 22 A2A_{2} B2B_{2} A1A_{1} 𝒩(0,2)\mathcal{N}(0,2) 𝒩(1,13)\mathcal{N}(1,13) B1B_{1} 𝒩(1,13)\mathcal{N}(1,13) 𝒩(8,29)\mathcal{N}(8,29) State 22B
[Uncaptioned image]
Refer to caption
(a) B1,B2\langle B_{1},B_{2}\rangle at State 22A
Refer to caption
(b) B1,B2\langle B_{1},B_{2}\rangle at State 22B
Figure 2: (a) and (b) plot the value function factorization of the joint action B1,B2\langle B_{1},B_{2}\rangle in State 2A and State 2B. The black line/curve shows the true return CDFs. The blue circles and the orange cross marks depict agent 1’s and agent 2’s learned utility, respectively, while the green squares indicate the estimated joint return.
Table 2: The learned factorization of DMIX. All of the cells show the sampled mean μ\mu and the sampled variance σ2\sigma^{2} with Bessel’s correction. The main content cells correspond to the joint return distributions for different combinations of states and actions. The first columns and first rows of these tables correspond to the distributions of the utilities for agents 1 and 2, respectively. The top-left cells of these tables are the state-dependent utility VV. DFAC enables the approximation of the true joint return distributions in Table 1, and allows them to be factorized into the distributions of the utilities for the agents.
State 11 A2A_{2} B2B_{2}
VV μ=0.32σ2=0.00\begin{aligned} {\color[rgb]{0.29296875,0,0.51171875}\definecolor[named]{pgfstrokecolor}{rgb}{0.29296875,0,0.51171875}\mu}&=-0.32\\ {\color[rgb]{0.29296875,0,0.51171875}\definecolor[named]{pgfstrokecolor}{rgb}{0.29296875,0,0.51171875}\sigma^{2}}&=0.00\end{aligned} μ=2.66σ2=0.10\begin{aligned} {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\mu}&=2.66\\ {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\sigma^{2}}&=0.10\end{aligned} μ=2.65σ2=0.10\begin{aligned} {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\mu}&=2.65\\ {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\sigma^{2}}&=0.10\end{aligned}
A1A_{1} μ=2.56σ2=0.08\begin{aligned} {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\mu}&=2.56\\ {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\sigma^{2}}&=0.08\end{aligned} μ=6.94σ2=0.00\begin{aligned} \mu&=6.94\\ \sigma^{2}&=0.00\end{aligned} μ=6.92σ2=0.00\begin{aligned} \mu&=6.92\\ \sigma^{2}&=0.00\end{aligned}
B1B_{1} μ=3.58σ2=19.11\begin{aligned} {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\mu}&=3.58\\ {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\sigma^{2}}&=19.11\end{aligned} μ=7.94σ2=21.85\begin{aligned} \mu&=7.94\\ \sigma^{2}&=21.85\end{aligned} μ=7.92σ2=21.86\begin{aligned} \mu&=7.92\\ \sigma^{2}&=21.86\end{aligned}
(a) Learned utilities of State 1
State 22A A2A_{2} B2B_{2}
μ=0.49σ2=0.00\begin{aligned} {\color[rgb]{0.29296875,0,0.51171875}\definecolor[named]{pgfstrokecolor}{rgb}{0.29296875,0,0.51171875}\mu}&=0.49\\ {\color[rgb]{0.29296875,0,0.51171875}\definecolor[named]{pgfstrokecolor}{rgb}{0.29296875,0,0.51171875}\sigma^{2}}&=0.00\end{aligned} μ=1.76σ2=0.00\begin{aligned} {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\mu}&=1.76\\ {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\sigma^{2}}&=0.00\end{aligned} μ=1.75σ2=0.00\begin{aligned} {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\mu}&=1.75\\ {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\sigma^{2}}&=0.00\end{aligned}
μ=2.09σ2=0.00\begin{aligned} {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\mu}&=2.09\\ {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\sigma^{2}}&=0.00\end{aligned} μ=7.01σ2=0.00\begin{aligned} \mu&=7.01\\ \sigma^{2}&=0.00\end{aligned} μ=6.99σ2=0.00\begin{aligned} \mu&=6.99\\ \sigma^{2}&=0.00\end{aligned}
μ=2.09σ2=0.00\begin{aligned} {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\mu}&=2.09\\ {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\sigma^{2}}&=0.00\end{aligned} μ=7.01σ2=0.00\begin{aligned} \mu&=7.01\\ \sigma^{2}&=0.00\end{aligned} μ=6.99σ2=0.00\begin{aligned} \mu&=6.99\\ \sigma^{2}&=0.00\end{aligned}
(b) Learned utilities of State 2A
State 22B A2A_{2} B2B_{2}
μ=0.38σ2=0.00\begin{aligned} {\color[rgb]{0.29296875,0,0.51171875}\definecolor[named]{pgfstrokecolor}{rgb}{0.29296875,0,0.51171875}\mu}&=0.38\\ {\color[rgb]{0.29296875,0,0.51171875}\definecolor[named]{pgfstrokecolor}{rgb}{0.29296875,0,0.51171875}\sigma^{2}}&=0.00\end{aligned} μ=4.55σ2=0.29\begin{aligned} {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\mu}&=-4.55\\ {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\sigma^{2}}&=0.29\end{aligned} μ=3.08σ2=5.87\begin{aligned} {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\mu}&=3.08\\ {\color[rgb]{0,0,0.7109375}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0.7109375}\sigma^{2}}&=5.87\end{aligned}
μ=3.50σ2=0.40\begin{aligned} {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\mu}&=-3.50\\ {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\sigma^{2}}&=0.40\end{aligned} μ=0.05σ2=1.37\begin{aligned} \mu&=-0.05\\ \sigma^{2}&=1.37\end{aligned} μ=1.01σ2=9.30\begin{aligned} \mu&=1.01\\ \sigma^{2}&=9.30\end{aligned}
μ=3.52σ2=6.81\begin{aligned} {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\mu}&=3.52\\ {\color[rgb]{0.703125,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0.703125,0,0}\sigma^{2}}&=6.81\end{aligned} μ=1.24σ2=9.86\begin{aligned} \mu&=1.24\\ \sigma^{2}&=9.86\end{aligned} μ=8.14σ2=25.30\begin{aligned} \mu&=8.14\\ \sigma^{2}&=25.30\end{aligned}
(c) Learned utilities of State 2B

5 Experiment Results on SMAC

In this section, we present the experimental results and discuss their implications. We start with a brief introduction to our experimental setup in Section 5.1. Then, we demonstrate that modeling a full distribution is beneficial to the performance of independent learners in Section 5.2. Finally, we compare the performances of the CTDE baseline methods and their DFAC variants in Section 5.3.

Refer to caption
Refer to caption
(c) 6h_vs_8z
Refer to caption
(d) 3s5z_vs_3s6z
Refer to caption
(e) MMM2
Refer to caption
(f) 27m_vs_30m
Refer to caption
(g) corridor
Figure 3: The win rate curves evaluated on the five Super Hard maps in SMAC for different CTDE methods.
Table 3: The median win rate %\% of five independent test runs.
Map IQL VDN QMIX DIQL DDN DMIX
(a) 0.00 0.00 12.78 0.00 83.92 49.43
(b) 29.83 89.20 67.22 62.22 94.03 91.08
(c) 68.92 89.20 92.44 85.23 97.22 95.11
(d) 2.27 63.12 84.77 6.02 91.48 85.45
(e) 84.87 85.34 37.61 91.62 95.40 90.45

* Maps (a)-(e) correspond to the maps in Fig. 3.

Table 4: The averaged scores of five independent test runs.
Map IQL VDN QMIX DIQL DDN DMIX
(a) 13.78 15.41 14.37 14.94 19.40 17.14
(b) 16.54 19.75 20.16 17.52 20.94 19.70
(c) 17.50 19.36 19.42 19.21 20.90 19.87
(d) 14.01 18.45 19.41 14.45 19.71 19.43
(e) 19.42 19.47 15.07 19.68 20.00 19.66

* Maps (a)-(e) correspond to the maps in Fig. 3.

5.1 Experimental Setup

Environment. We verify the DFAC framework in the SMAC benchmark environments (Samvelyan et al., 2019) built on the popular real-time strategy game StarCraft II. Instead of playing the full game, SMAC is developed for evaluating the effectiveness of MARL micro-management algorithms. Each environment in SMAC contains two teams. One team is controlled by a decentralized MARL algorithm, with the policies of the agents conditioned on their local observation histories. The other team consists of enemy units controlled by the built-in game artificial intelligence based on carefully handcrafted heuristics, which is set to its highest difficulty equal to seven. The overall objective is to maximize the win rate for each battle scenario, where the rewards employed in our experiments follow the default settings of SMAC. The default settings use shaped rewards based on the damage dealt, enemy units killed, and whether the RL agents win the battle. If there is no healing unit in the enemy team, the maximum return of an episode (i.e., the score) is 2020; otherwise, it may exceed 2020, since enemies may receive more damages after healing or being healed.

The environments in SMAC are categorized into three different levels of difficulties: Easy, Hard, and Super Hard scenarios (Samvelyan et al., 2019). In this paper, we focus on all Super Hard scenarios including (a) 6h_vs_8z, (b) 3s5z_vs_3s6z, (c) MMM2, (d) 27m_vs_30m, and (e) corridor, since these scenarios have not been properly addressed in the previous literature without the use of additional assumptions such as intrinsic reward signals (Du et al., 2019), explicit communication channels (Zhang et al., 2019; Wang et al., 2019), common knowledge shared among the agents (de Witt et al., 2019; Wang et al., 2020), and so on. Three of these scenarios have their maximum scores higher than 2020. In 3s5z_vs_3s6z, the enemy Stalkers have the ability to regenerate shields; in MMM2, the enemy Medivacs can heal other units; in corridor, the enemy Zerglings slowly regenerate their own health.

Hyperparameters. For all of our experimental results, the training length is set to 8M timesteps, where the agents are evaluated every 40k timesteps with 32 independent runs. The curves presented in this section are generated based on five different random seeds. The solid lines represent the median win rate, while the shaded areas correspond to the 25th25^{\text{th}} to 75th75^{\text{th}} percentiles. For a better visualization, the presented curves are smoothed by a moving average filter with its window size set to 11. The detailed hyperparameter setups are provided in the supplementary material.

Baselines. We select IQL, VDN, and QMIX as our baseline methods, and compare them with their distributional variants in our experiments. The configurations are optimized so as to provide the best performance for each of the methods considered. Since we tuned the hyperparameters of the baselines, their performances are better than those reported in (Samvelyan et al., 2019). The hyperparameter searching process is detailed in the supplementary material.

5.2 Independent Learners

In order to validate our assumption that distributional RL is beneficial to the MARL domain, we first employ the simplest training algorithm, IQL, and extend it to its distributional variant, called DIQL. DIQL is simply a modified IQL that uses IQN as its underlying RL algorithm without any additional modification or enhancements (Matignon et al., 2007; Lyu & Amato, 2020).

From Figs. 3(a)-3(e) and Tables 4-4, it is observed that DIQL is superior to IQL even without utilizing any value function factorization methods. This validates that distributional RL has beneficial influences on MARL, when it is compared to RL approaches based only on expected values.

5.3 Value Function Factorization Methods

In order to inspect the effectiveness and impacts of DFAC on learning curves, win rates, and scores, we next summarize the results of the baselines as well as their DFAC variants on the Super Hard scenarios in Fig. 3(a)-(e) and Table 4-4.

Fig. 3(a)-(e) plot the learning curves of the baselines and their DFAC variants, with the final win rates presented in Table 4, and their final scores reported in Table 4. The win rates indicate how often do the player’s team wins, while the scores represent how well do the player’s team performs. Despite the fact that SMAC’s objective is to maximize the win rate, the true optimization goal of MARL algorithms is the averaged score. In fact, these two metrics are not always positively correlated (e.g., VDN and QMIX in 6h_vs_8z and 3s5z_vs_3s6z, and QMIX and DMIX in 3s5z_vs_3s6z).

It can be observed that the learning curves of DDN and DMIX grow faster and achieve higher final win rates than their corresponding baselines. In the most difficult map: 6h_vs_8z, most of the methods fail to learn an effective policy except for DDN and DMIX. The evaluation results also show that DDN and DMIX are capable of performing consistently well across all Super Hard maps with high win rates. In addition to the win rates, Table 4 further presents the final averaged scores achieved by each method, and provides deeper insights into the advantages of the DFAC framework by quantifying the performances of the learned policies of different methods.

The improvements in win rates and scores are due to the benefits offered by distributional RL (Lyle et al., 2019), which enables the distributional variants to work more effectively in MARL environments. Moreover, the evaluation results reveal that DDN performs especially well in most environments despite its simplicity. Further validations of DDN and DMIX on our self-designed Ultra Hard scenarios that are more difficult than Super Hard scenarios can be found in our GitHub repository (https://github.com/j3soon/dfac), along with the gameplay recording videos.

6 Conclusion

In this paper, we provided a distributional perspective on value function factorization methods, and introduced a framework, called DFAC, for integrating distributional RL with MARL domains. We first proposed DFAC based on a mean-shape decomposition procedure to ensure the Distributional IGM condition holds for all factorizable tasks. Then, we proposed the use of quantile mixture to implement the mean-shape decomposition in a computationally friendly manner. DFAC’s ability to factorize the joint return distribution into individual utility distributions was demonstrated by a toy example. In order to validate the effectiveness of DFAC, we presented experimental results performed on all Super Hard scenarios in SMAC for a number of MARL baseline methods as well as their DFAC variants. The results show that DDN and DMIX outperform VDN and QMIX. DFAC can be extended to more value function factorization methods and offers an interesting research direction for future endeavors.

7 Acknowledgements

The authors acknowledge the support from NVIDIA Corporation and NVIDIA AI Technology Center (NVAITC). The authors thank Kuan-Yu Chang for his helpful critiques of this research work. The last author would like to thank the funding support from Ministry of Science and Technology (MOST) in Taiwan under grant nos. MOST 110-2636-E-007-010 and MOST 110-2634-F-007-019.

References

  • Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 449–458, Jul. 2017.
  • Bellemare et al. (2019) Bellemare, M. G., Roux, N. L., Castro, P. S., and Moitra, S. Distributional reinforcement learning with linear function approximation. arXiv preprint arXiv:1902.03149, 2019.
  • Dabney et al. (2018a) Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 1096–1105, Jul. 2018a.
  • Dabney et al. (2018b) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. In Proc. AAAI Conf. on Artificial Intelligence (AAAI), pp. 2892–2901, Feb. 2018b.
  • de Witt et al. (2019) de Witt, C. S., Foerster, J., Farquhar, G., Torr, P., Böhmer, W., and Whiteson, S. Multi-agent common knowledge reinforcement learning. In Advances in Neural Information Processing Systems, pp. 9924–9935, 2019.
  • Du et al. (2019) Du, Y., Han, L., Fang, M., Liu, J., Dai, T., and Tao, D. Liir: Learning individual intrinsic reward in multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4405–4416, 2019.
  • Guestrin et al. (2001) Guestrin, C., Koller, D., and Parr, R. Multiagent planning with factored mdps. In NIPS, 2001.
  • Karvanen (2006) Karvanen, J. Estimation of quantile mixtures via l-moments and trimmed l-moments. Computational Statistics & Data Analysis, 51:947–959, 11 2006. doi: 10.1016/j.csda.2005.09.014.
  • Lin et al. (2019) Lin, Z., Zhao, L., Yang, D., Qin, T., Liu, T.-Y., and Yang, G. Distributional reward decomposition for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 6212–6221, 2019.
  • Lyle et al. (2019) Lyle, C., Bellemare, M. G., and Castro, P. S. A comparative analysis of expected and distributional reinforcement learning. In Proc. AAAI Conf. on Artificial Intelligence (AAAI), pp. 4504–4511, Feb. 2019.
  • Lyu & Amato (2020) Lyu, X. and Amato, C. Likelihood quantile networks for coordinating multi-agent reinforcement learning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp.  798–806, 2020.
  • Matignon et al. (2007) Matignon, L., Laurent, G., and Fort-Piat, N. Hysteretic q-learning : an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. pp.  64 – 69, 12 2007. doi: 10.1109/IROS.2007.4399095.
  • Mavrin et al. (2019) Mavrin, B., Yao, H., Kong, L., Wu, K., and Yu, Y. Distributional reinforcement learning for efficient exploration. In Proc. Int. Conf. on Machine Learning (ICML), pp. 4424–4434, Jul. 2019.
  • Nikolov et al. (2019) Nikolov, N., Kirschner, J., Berkenkamp, F., and Krause, A. Information-directed exploration for deep reinforcement learning. In Proc. Int. Conf. on Learning Representations (ICLR), May 2019. URL https://openreview.net/forum?id=Byx83s09Km.
  • Oliehoek & Amato (2016) Oliehoek, F. A. and Amato, C. A Concise Introduction to Decentralized POMDPs. Springer, 2016. ISBN 3319289276.
  • Oliehoek et al. (2016) Oliehoek, F. A., Amato, C., et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016.
  • Rashid et al. (2018) Rashid, T. et al. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 4295–4304, Jul. 2018.
  • Rowland et al. (2019) Rowland, M. et al. Statistics and samples in distributional reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 5528–5536, Jul. 2019.
  • Samvelyan et al. (2019) Samvelyan, M. et al. The starcraft multi-agent challenge. In Proc. Int. Conf. on Autonomous Agents and MultiAgent Systems (AAMAS), pp.  2186–2188, May 2019.
  • Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International conference on machine learning, pp. 1312–1320, 2015.
  • Son et al. (2019) Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 5887–5896, Jul. 2019.
  • Sunehag et al. (2018) Sunehag, P. et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proc. Int. Conf. on Autonomous Agents and MultiAgent Systems (AAMAS), pp.  2085–2087, May 2018.
  • Tan (1993) Tan, M. Multi-agent reinforcement learning: Independent versus cooperative agents. In Proc. Int. Conf. on Machine Learning (ICML), pp. 330–337, Jun. 1993. ISBN 1558603077.
  • Wang et al. (2019) Wang, T., Wang, J., Zheng, C., and Zhang, C. Learning nearly decomposable value functions via communication minimization. arXiv preprint arXiv:1910.05366, 2019.
  • Wang et al. (2020) Wang, T., Dong, H., Lesser, V., and Zhang, C. Multi-agent reinforcement learning with emergent roles. arXiv preprint arXiv:2003.08039, 2020.
  • Yang et al. (2019) Yang, D. et al. Fully parameterized quantile function for distributional reinforcement learning. In Proc. Conf. Advances in Neural Information Processing Systems (NeurIPS), pp.  6190–6199, Dec. 2019.
  • Zhang & Yao (2019) Zhang, S. and Yao, H. Quota: The quantile option architecture for reinforcement learning. In Proc. AAAI Conf. on Artificial Intelligence (AAAI), pp. 5797–5804, Feb. 2019.
  • Zhang et al. (2019) Zhang, S. Q., Zhang, Q., and Lin, J. Efficient communication in multi-agent reinforcement learning via variance based control. In Advances in Neural Information Processing Systems, pp. 3230–3239, 2019.

See pages - of supp.pdf