This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-Task Multi-Agent Shared Layers are Universal Cognition of Multi-Agent Coordination

Jiawei Wang
Tongji University
wangjw@tongji.edu.cn
&Jian Zhao
Polixir
jian.zhao@polixir.ai
&Zhengtao Cao
Polixir
zhengtao.cao@polixir.ai
&Ruili Feng
Alibaba Group
ruilifengustc@gmail.com
&Rongjun Qin
Nanjing University, Polixir
qinrj@lamda.nju.edu.cn
&Yang Yu*
Nanjing University, Polixir
yuy@nju.edu.cn
Abstract

Multi-agent reinforcement learning shines as the pinnacle of multi-agent systems, conquering intricate real-world challenges, fostering collaboration and coordination among agents, and unleashing the potential for intelligent decision-making across domains. However, training a multi-agent reinforcement learning network is a formidable endeavor, demanding substantial computational resources to interact with diverse environmental variables, extract state representations, and acquire decision-making knowledge. The recent breakthroughs in large-scale pre-trained models ignite our curiosity: Can we uncover shared knowledge in multi-agent reinforcement learning and leverage pre-trained models to expedite training for future tasks? Addressing this issue, we present an innovative multi-task learning approach that aims to extract and harness common decision-making knowledge, like cooperation and competition, across different tasks. Our approach involves concurrent training of multiple multi-agent tasks, with each task employing independent front-end perception layers while sharing back-end decision-making layers. This effective decoupling of state representation extraction from decision-making allows for more efficient training and better transferability. To evaluate the efficacy of our proposed approach, we conduct comprehensive experiments in two distinct environments: the StarCraft Multi-agent Challenge (SMAC) and the Google Research Football (GRF) environments. The experimental results unequivocally demonstrate the smooth transferability of the shared decision-making network to other tasks, thereby significantly reducing training costs and improving final performance. Furthermore, visualizations authenticate the presence of general multi-agent decision-making knowledge within the shared network layers, further validating the effectiveness of our approach.

Keywords Multi-Agent Coordination  \cdot Multi-Task Learning  \cdot Common knowledge

1 Introduction

Cooperative multi-agent reinforcement learning (MARL) has garnered significant attention due to its vast potential in solving real-world challenges, including traffic light control [1], autonomous cars [2], and robot swarm control [3]. However, training a multi-agent network can be prohibitively expensive. It requires frequent communication with complex environmental variables and consumes substantial time and resources. For instance, AlphaStar [4], OpenAI Five [5], and JueWu [6] required hundreds of thousands of cores and several months to achieve satisfactory performance. Such computational demands pose barriers to general research, limiting the full utilization of this exciting progress in modern machine intelligence by academic institutions and the broader intelligence community. Consequently, exploring methods to effectively extract and leverage shared knowledge in multi-agent decision tasks has emerged as a prominent research direction.

Inspired by the remarkable advancements of large-scale pre-trained models, a fascinating and promising question arises: Can we separate shared knowledge from the multi-agent learning framework and store it in pre-trained networks to reduce the cost of task transfer?

The primary challenge in applying pre-trained models to multi-agent reinforcement learning arises from the frequent variations in the formats of state and action spaces, even within a single task. Pre-trained models typically require standardized input and output formats, such as image pixels in computer vision or text tokens in natural language processing. Fortunately, this paper successfully identifies unique shared knowledge in multi-agent reinforcement learning. Agents often exhibit shared decision-making priors for specific tasks, such as cooperation and competition, while certain global conditions influence the decisions of all agents across diverse scenarios and tasks. However, extracting and sharing this common decision-making knowledge is not a straightforward path. In decision tasks, neural networks serve dual roles in perception and decision-making, and these functions are tightly intertwined. Separating them and distilling a rich and universal decision-making knowledge across multi-agent tasks is a challenging endeavor.

This paper introduces a pioneering multi-task learning approach that effectively tackles the aforementioned challenges and extends the power of pre-training to the realm of multi-agent reinforcement learning. By employing task-specific front-end perception modules and a shared back-end decision-making module, this approach effectively decouples the decision process, enabling the efficient disentanglement and extraction of shared decision-making knowledge. To handle varying action spaces, actions are positioned at the input end of the network, ensuring consistency in network output dimensions, which is referred to as the action prepositioning network (APN). Additionally, dynamic adaptive weights for multi-task losses are introduced to enhance multi-task pre-training efficiency.

We conduct extensive experiments in StarCraft Multi-agent Challenge (SMAC) [7] and Google Research Football (GRF) [8] environments. Our findings demonstrate the effectiveness of our approach in extracting common decision knowledge across different tasks and efficiently transferring it to new tasks, significantly reducing training costs and improving network performance. Our method significantly benefits from increasing the number of pre-training tasks, showing potential in large-scale scenarios. Furthermore, through visualization, we validate that the extracted decision networks contain universal decision knowledge for multi-agent coordination, shedding new light on future research of pre-trained models in multi-agent reinforcement learning.

2 Related Work

2.1 Pre-trained Model

A pre-trained model is a model that’s already well-trained, and it shares its knowledge with a new model to speed up the new model’s training, without the need to start from scratch. Pretrained models [9, 10] extract common knowledge of a broad range of tasks from data. It can accelerate and improve training of downstream tasks, by leveraging the common knowledge stored in them in various means. Pre-trained models can be applied to a wide range of downstream tasks, such as image classification [11, 12, 13, 14], object detection [15, 16, 17, 18, 19], text classification [20, 21, 22, 23, 24] and more. When training for downstream tasks, we need to load the parameters of the pre-trained model’s feature layers. After that, keeping the feature layers (the initial layers) fixed and fine-tuning the parameters of the later layers [25] because the initial layers primarily focus on extracting generic information. For instance, in ResNet, the early layers are responsible for capturing low-level features like textures and shapes.

2.2 Pre-trained Model by SL

Traditionally, pre-trained models are usually trained through supervised learning (SL) on large data sets. They possess extensive language and visual knowledge, making them directly applicable to new tasks or amenable to fine-tuning for specific tasks. Using pre-trained models significantly reduces the time and computational resources required to build and train models from scratch, especially when data for new tasks is limited. In supervised learning, especially in the field of image processing, we often use large labeled datasets like ImageNet [26], Open Images dataset [27], and COCO [28] to pre-train models and obtain large pre-trained models such as AlexNet [29], VGG [30], and ResNet [31]. These pre-trained models typically learn generic features that can be useful for many downstream tasks. These features speed up the process of learning specific characteristics for the downstream tasks, making the overall training faster.

2.3 Pre-trained Model in SSL

Because the cost of data annotation is too high, researchers use self-supervised learning (SSL) methods to train pre-trained models. Up to now, self-supervised pre-trained models are being applied in various domains, including NLP [20, 32, 33, 34], computer vision [35, 36, 37, 38], and audio [39]. Unlike supervised learning, self-supervised pre-training doesn’t require labeled data. There are two main approaches: one involves creating self-generated tasks to learn feature representations from the data, such as tasks like image completion, text generation, image rotation, and image augmentation. The other approach uses contrastive learning to bring similar samples closer in feature space, thus capturing common underlying features among similar samples.

2.4 Pre-trained Model in RL

Inspired by large-scale language modeling, DeepMind has applied a similar approach to create a unified intelligent agent called Gato [40]. Gato has been trained on over 600600 different tasks, each with varying patterns, observations, and action guidelines. However, Gato, despite integrating a variety of decision tasks, still lacks generalization abilities. This is because in the single-agent domain, different decision tasks require completely different knowledge. For example, in Atari games, tasks like pong and breakout have no shared knowledge between them. In contrast, in multi-agent tasks, such as 3s5z3s5z to 3s5zvs3s6z3s5zvs3s6z in SMAC, there exists strong shared knowledge, such as teammate attributes, enemy attributes, strategies like focus fire, hit and run, and more. In multi-agent reinforcement learning, there typically exists significant variation in the state space among different tasks. Current pre-trained methods yield universal perceptual models that cannot be shared across diverse multi-agent tasks. Different multi-agent tasks often involve more shared decision knowledge, such as agent cooperation. Therefore, we propose a multi-task, multi-agent pre-training approach aimed at acquiring pre-training models that encompass shared decision knowledge.

3 Preliminaries

We model multi-agent reinforcement learning (MARL) as a decentralized partially observable Markov decision process (Dec-POMDP) [41]. A multi-agent Dec-POMDP can be represented as a tuple (𝒩,𝒮,𝒜,P,R,𝒪,G,γ)(\mathcal{N},\mathcal{S},\mathcal{A},P,R,\mathcal{O},G,\gamma). Here, 𝒩={1,2,,N}\mathcal{N}=\{1,2,\cdots,N\} denotes the N=|𝒩|N=|\mathcal{N}| agents in the multi-agent system. 𝒮\mathcal{S} represents the state space, 𝒜\mathcal{A} is the action space, and 𝒪\mathcal{O} corresponds to the observation space. The reward function is denoted as R:𝒮×𝒜R:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R}. At each time step tt, each agent ii performs an action ai𝒜a_{i}\in\mathcal{A}, collectively forming a joint action 𝐚\mathbf{a}. After taking action 𝐚\mathbf{a} in state ss, all agents receive rewards R(s,𝐚)R(s,\mathbf{a}). PP represents the state transition probability that satisfies the Markov property: 𝒮×𝒜×𝒮[0,1]\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1]. Each agent ii, after taking action aia_{i} in state ss, observes oio_{i} based on the observation function G(s,𝐚):𝒮×𝒜𝒪G(s,\mathbf{a}):\mathcal{S}\times\mathcal{A}\mapsto\mathcal{O}. γ[0,1)\gamma\in[0,1) is the discount factor. Each agent ii has a policy πi(ati|o1:ti)\pi^{i}(a_{t}^{i}|o_{1:t}^{i}) to predict action atia_{t}^{i} based on its historical observations. π(st,𝐚t)\pi(s_{t},\mathbf{a}_{t}) denotes the joint policy for all agents. The ultimate goal is to learn an optimal joint policy that maximizes the discounted expected return 𝔼[t=0γtR(st,𝐚t)]\mathbb{E}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},\mathbf{a}_{t})].

Value-based MARL algorithms, such as VDN [42] and QMIX [43], collaborate by utilizing action-value functions and joint action-value functions to make optimal decisions that maximize the cumulative rewards for the entire group of agents. VDN endeavors to decompose the total action-value function, denoted as Qtot(𝐬,𝐚)Q_{tot}(\mathbf{s},\mathbf{a}), into a composition of distinct local action-value functions Qi(si,ai)Q_{i}(s_{i},a_{i}):

Qtot(𝐬,𝐚)=i=1NQi(si,ai),Q_{tot}(\mathbf{s},\mathbf{a})=\sum_{i=1}^{N}Q_{i}(s_{i},a_{i}), (1)

where Qi(si,ai)Q_{i}(s_{i},a_{i}) characterizes the action-value function of agent ii. In the training phase, the optimization of the policy is guided by QtotQ_{tot}. Simultaneously, each agent ii derives its local action-value function QiQ_{i} from the global QtotQ_{tot} value to make decisions. QMIX operates under the assumption that a global argmaxargmax performed on QtotQ_{tot} is equivalent to a set of individual argmaxargmax operations performed on each QiQ_{i}. To facilitate this, QMIX employs a mixing network as a functional expression to amalgamate QiQ_{i} values into the QtotQ_{tot} value, enabling it to meet:

QtotQi0,i{1,2,,N}.\frac{\partial Q_{tot}}{\partial Q_{i}}\geq 0,\forall i\in\{1,2,\cdots,N\}. (2)

4 Method

Refer to caption
Figure 1: Overall framework of the proposed multi-task multi-agent pre-training method. The ‘lock’ icon indicates that the parameters are fixed when using pre-trained DecL.

MARL employs neural networks for state representation extraction and decision-making, which are tightly coupled. We train multiple multi-agent tasks simultaneously, where each task utilizes independent front-end network layers alongside shared back-end network layers. The shared back-end network layers effectively decouple state representation extraction from decision-making, thus facilitating the training of back-end network layers encompassing common decision knowledge. When addressing new multi-agent tasks, we leverage the back-end network layers containing shared decision knowledge, keeping network parameters fixed and focusing solely on improving the front-end network layers’ ability to extract state representations. Through this approach, we substantially reduce training costs. The framework of our MTAPN approach is illustrated in Fig. 1.

4.1 Multi-Task Pre-training

We refer to the front-end network layer as the Perception Layer (PerL) and the back-end network layer as the Decision Layer (DecL). During the pre-training phase, we simultaneously train NN multi-agent tasks. For each taskntask_{n}n={1,2,,N}n=\{1,2,\cdots,N\}, we collect data generated by the interaction between agents and the environment in taskntask_{n} and store it in the replay buffer BuffernBuffer_{n}. We then sample from BuffernBuffer_{n} for training. Samples are input into PerLn{\rm PerL}_{n} to extract perceptual features, followed by using DecL to predict actionnaction_{n}. Each taskntask_{n} has its dedicated PerLn{\rm PerL}_{n} while sharing DecL. Once pre-training is complete, we obtain DecL containing common decision knowledge among different tasks. We fix the parameters of the pre-trained DecL and transfer it to a new task, subsequently training PerL for that task. This approach efficiently allows us to learn effective policies for new tasks, leading to a significant reduction in training costs.

4.2 Action Prepositioning Network

Value-based MARL algorithms, such as VDN and QMIX, estimate state-action values by taking the state ss as the network input and generating estimated values for each action aia_{i} under that state. The network’s output size remains fixed at the size of the task’s action space. Due to variations in action spaces across different multi-agent tasks, sharing the back-end network layer between tasks is infeasible. To address this challenge, we position the action aia_{i} at the forefront of the network’s input. The network takes state-action pairs as input and yields corresponding value estimates as output. This structural design maintains a fixed network output size of 11. By utilizing front-end network layers to map inputs from different tasks to a consistent dimension dd, we enable sharing the back-end network layer across different tasks. Pairing state and all actions as inputs to the network results in value estimates for all state-action pairs. This network architecture is referred to as the Action Prepositioning Network (APN).

Based on APN, we can share DecL across different tasks during multi-task pre-training. We will illustrate the training process of any task taskntask_{n} using APN at time step tt. For simplicity, we omit the tt subscripts for all variables. taskntask_{n} comprises MM agents, each with an action space size of KK. omo_{m} represents the observation of agent mm, ama_{m}^{{}^{\prime}} is the action executed by agent mm in the previous time step, idmid_{m} is the identifier of agent mm, and am,ka_{m,k} represents the kk-th executable action of agent mmm={1,2,,M}m=\{1,2,\cdots,M\}k={1,2,,K}k=\{1,2,\cdots,K\}. For each agent mm, perceptual features fmf_{m} are extracted from the task’s input using independent PerLn{\rm PerL}_{n}:

fm=PerLn(om,am,idm,am,1,am,2,,am,K),f_{m}={\rm PerL}_{n}(o_{m},a_{m}^{{}^{\prime}},id_{m},a_{m,1},a_{m,2},\cdots,a_{m,K}), (3)

where fmK×df_{m}\in\mathbb{R}^{K\times d}, with dd representing the dimension of perceptual features. Given that each agent can only partially observe the environment at each time step, we employ the GRU recurrent neural network in PerL to encode historical trajectories and address the issue of partial observability. These fmf_{m} are inputted into the shared DecL to predict value estimates QmQ_{m}:

Qm=DecL(fm),Q_{m}={\rm DecL}(f_{m}), (4)

where QnK×1Q_{n}\in\mathbb{R}^{K\times 1}. We select the maximum Q value, denoted as QQ_{*}, for each individual agent and aggregate them through a mixer to predict the total Q value QpredictQ_{\rm predict}. In this context, we employ a mixer the same as VDN. We input the same data into the target APN network to generate the target total Q value QtargetQ_{\rm target}. The APN is updated by minimizing the error between QpredictQ_{\rm predict} and QtargetQ_{\rm target}.

4.3 Dynamic Adaptive Weighting for Multi-Task Losses

During multi-task training, not all tasks converge at the same pace. Typically, simpler tasks tend to converge faster than complex ones. To enhance the efficiency of multi-task training, it becomes essential to allocate larger loss weights to those tasks that exhibit slower convergence. We standardize evaluation metrics across different tasks. EMnEM_{n} represents the normalized evaluation metric value for taskntask_{n} and is used to assess the current convergence status of taskntask_{n}. EMnEM_{n}^{-} represents the inverse evaluation metric of EMnEM_{n}, such as the negative or reciprocal value of EMnEM_{n}. A smaller EMnEM_{n}, or equivalently, a larger EMnEM_{n}^{-}, indicates poorer convergence for the corresponding task. We introduce a method for dynamically and adaptively assigning weights to multi-task losses, wherein the weights are adjusted as EMnEM_{n}^{-} change during the training process. Here, lossnloss_{n} corresponds to the loss of taskntask_{n}, ωn\omega_{n} represents the weight of lossnloss_{n}, and the total loss total_losstotal\_loss for all tasks is calculated as follows:

ωn=EMn/i=1NEMi,\displaystyle\omega_{n}=EM_{n}^{-}/\sum_{i=1}^{N}EM_{i}^{-}, (5)
total_loss=i=1N(ωi×lossi).\displaystyle total\_loss=\sum_{i=1}^{N}(\omega_{i}\times loss_{i}). (6)

5 Experiments

We conducted experiments in the StarCraft Multi-agent Challenge (SMAC) [7] and Google Research Football (GRF) [8] environments. Unless otherwise specified, both the pre-training algorithm and downstream task algorithm utilized in the experiments are VDN [42]. Multiple tasks are employed to pre-train the VDN network based on APN, extracting the DecL containing common knowledge. Utilizing the common knowledge in DecL, we exclusively train the PerL on new tasks. In SMAC and GRF tasks, we employ win rate as EMEM and failure rate as EMEM^{-}. Each set of experimental results is obtained using 55 random seeds. The solid line shows the mean win rate, and the shaded area represents the minimum to maximum win rate for 5 random seeds. Actions are chosen for exploration based on the estimated QQ values using an ϵ\epsilon-greedy policy. The algorithms we use are based on the Pymarl2 algorithm library [44]. Detailed hyperparameter settings and code are provided in the supplementary material.

5.1 StarCraft Multi-Agent Challenge (SMAC)

We increase the number of tasks used for pre-training to explore whether incorporating more tasks can lead to improved acquisition of common knowledge. All our experiments were conducted specifically on the hard and super-hard difficulty levels of the SMAC task. The tasks involved in the experiments are detailed in table 1.

Table 1: The maps in the SMAC tasks used in the experiment and their corresponding difficulty.
Task Map Difficulty
task1task_{1} 5m_vs_6m Hard
task2task_{2} 8m_vs_9m Hard
task3task_{3} 3s_vs_5z Hard
task4task_{4} bane_vs_bane Hard
task5task_{5} 2c_vs_64zg Hard
task6task_{6} corridor Super Hard
task7task_{7} MMM2 Super Hard
task8task_{8} 3s5z_vs_3s6z Super Hard
task9task_{9} 27m_vs_30m Super Hard
task10task_{10} 6h_vs_8z Super Hard

Three-task pre-training. We employ two sets of tasks for pre-training DecL: Group1:(task1,task3,task5)Group_{1}:(task_{1},task_{3},task_{5}) and Group2:(task7,task8,task9)Group_{2}:(task_{7},\\ task_{8},task_{9}). All tasks within Group1Group_{1} are of hard difficulty, while those in Group2Group_{2} are of super hard difficulty. The DecL pre-trained with Group1Group_{1} is transferred and separately fine-tuned for PerL training on task7task_{7} and task10task_{10}. Similarly, the DecL pre-trained with Group2Group_{2} is transferred and separately fine-tuned for PerL training on task3task_{3} and task4task_{4}. Experimental results are depicted in Fig. 2.

Refer to caption Refer to caption Refer to caption
(a) pre-training of Group1Group_{1} tasks (b) from Group1Group_{1} tasks to task7task_{7} (c) from Group1Group_{1} tasks to task10task_{10}
Refer to caption Refer to caption Refer to caption
(d) pre-training of Group2Group_{2} tasks (e) from Group2Group_{2} tasks to task3task_{3} (f) from Group2Group_{2} tasks to task4task_{4}
Figure 2: Win rates of DecL pre-training using three tasks and transferred to the new SMAC task.

From Fig. 2 (a) and (b), it is evident that the APN network, during multi-task training, effectively learns strategies for each task. The DecL obtained from multi-task pre-training encompasses decision knowledge utilized in all tasks, namely, generic decision knowledge. Fig. 2 (c) shows that using the DecL pre-trained on three hard tasks results in faster convergence for task7task_{7} compared to training from scratch. Fig. 2 (d) demonstrates that, with the DecL pre-trained on three hard tasks, task10task_{10} achieves superior convergence compared to starting from scratch. Fig. 2 (e) shows that utilizing the DecL pre-trained on three super hard tasks leads to faster convergence and better final convergence results for task3task_{3} compared to starting from scratch. However, Fig. 2 (f) reveals that, despite being pre-trained on three super hard tasks, task4task_{4} exhibits convergence results and rates consistent with training from scratch, indicating no improvement. Clearly, in comparison to DecL pre-trained on a single task, DecL pre-trained on three tasks shows better performance when transferring to new tasks. This highlights the advantage of having more pre-training tasks in obtaining DecL with more comprehensive generic decision knowledge. The lack of improvement in the convergence performance of task4task_{4} also suggests that the decision knowledge contained within DecL can only cover some tasks comprehensively, prompting further consideration of increasing the number of pre-training tasks.

Six-task pre-training. We select six tasks for pre-training DecL:
Group3:(task1,task2,task5,task6,task7,task8)Group_{3}:(task_{1},task_{2},task_{5},task_{6},task_{7},task_{8}). We fix the pre-trained DecL parameters and transfer them separately to task4task_{4} and task10task_{10}. The experimental results are depicted in Fig. 3.

Refer to caption Refer to caption Refer to caption
(a) pre-training of Group3Group_{3} tasks (b) from Group3Group_{3} tasks to task4task_{4} (c) from Group3Group_{3} tasks to task10task_{10}
Figure 3: Win rates of Group3Group_{3} tasks pre-training, and using Group3Group_{3} pre-trained DecL to train task4task_{4} and task10task_{10}.

Fig. 3 (a) demonstrates that our proposed method continues to acquire effective policies across all tasks when the number of pre-training tasks increases to six. In Fig. 3 (b), we observe that training task4task_{4} with DecL pre-trained based on Group3Group_{3} tasks results in faster convergence and superior final performance compared to training from scratch. Furthermore, employing DecL pre-trained on Group3Group_{3} tasks for task4task_{4} converges more rapidly than using DecL pre-trained on Group2Group_{2} tasks. The same result can also be found in Fig. 3 (c). Training task10task_{10} with DecL pre-trained on Group3Group_{3} leads to faster convergence and improved outcomes compared to using Group1Group_{1} pre-trained DecL. In summary, through multi-task pre-training, we can extract DecL containing common decision knowledge, which becomes more generalized with increased pre-training tasks.

VDN to QMIX. To further validate the inclusion of common decision knowledge in the DecL pre-trained with the VDN algorithm, we migrate the pre-trained DecL from three hard difficulty and three super hard difficulty SAMC tasks to task3task_{3} and task7task_{7}, respectively. We employ the QMIX algorithm to train PerL on these new tasks. Experimental results are presented in Fig. 4.

Refer to caption Refer to caption
(a) from Group2Group_{2} tasks to task3task_{3} (b) from Group1Group_{1} tasks to task7task_{7}
Figure 4: Win rates of Using QMIX to train new tasks with DecL pre-trained on multi-tasks based on APN-VDN.

Fig. 4 demonstrates that when the DecL pre-trained on three tasks using the VDN algorithm is transferred to new tasks and trained with the QMIX algorithm, the final convergence results are superior to those of the QMIX algorithm trained from scratch. Consequently, the DecL obtained from multi-task pre-training indeed encompasses common decision knowledge. This common decision knowledge is applicable to various new tasks as well as new multi-agent reinforcement learning algorithms.

5.2 Ablation Study

APN-VDN vs VDN. To enable multi-task training, we initially place the action at the front of the network input and maintain this structure when transitioning to new tasks. To illustrate the impact of adopting this structure on experimental performance, we train original VDN and APN-VDN on all SMAC tasks, with the results presented in Fig. 5 and table 2.

Refer to caption Refer to caption Refer to caption
(a) task1task_{1} (b) task6task_{6} (c) task9task_{9}
Figure 5: Win rates of training SMAC tasks using original VDN and APN-VDN.
Table 2: Convergence results of original VDN and APN-VDN on all SMAC tasks. Bold indicates optimal results. Following ±``\pm" is the variance of the experimental results of multiple random seeds.
Map Difficulty original VDN APN-VDN
5m_vs_6m Hard 80.38±\pm0.06 79.31±\pm0.06
8m_vs_9m Hard 89.37±\pm0.08 86.55±\pm0.05
3s_vs_5z Hard 94.93±\pm0.11 98.05±\pm0.01
bane_vs_bane Hard 97.16±\pm0.02 99.97±\pm0.00
2c_vs_64zg Hard 97.24±\pm0.01 96.11±\pm0.05
corridor Super Hard 90.74±\pm0.09 92.45±\pm0.03
MMM2 Super Hard 96.27±\pm0.01 96.99±\pm0.01
3s5z_vs_3s6z Super Hard 73.65±\pm0.17 62.11±\pm0.17
27m_vs_30m Super Hard 95.99±\pm0.01 97.71±\pm0.00
6h_vs_8z Super Hard 1.08±\pm0.01 1.83±\pm0.01

As shown in Fig. 5, adopting the APN structure has a slight impact on the convergence speed of the original VDN but does not affect the ultimate convergence results. Table 2 demonstrates that APN-VDN and original VDN achieve nearly identical convergence results across all SMAC tasks. Therefore, employing the APN structure does not influence the performance of multi-agent algorithms. Consequently, we use APN-VDN as the baseline algorithm for comparison in our experiments, with variations in algorithm performance stemming from the utilization of pre-trained DecL.

Single-task Pre-training. We conduct experiments on task1task_{1}, task2task_{2}, and task9task_{9}. We perform pre-training on task2task_{2} and task9task_{9} to obtain DecL with task-specific decision knowledge. Subsequently, we freeze the parameters of DecL and transfer them to task1task_{1} for retraining PerL. Experimental results are illustrated in Fig. 6.

Refer to caption Refer to caption
(a) from task2task_{2} to task1task_{1} (b) from task9task_{9} to task1task_{1}
Figure 6: Win rates of training new SMAC tasks using single-task pre-trained DecL.

Fig. 6 reveals that when using DecL pre-trained on task2task_{2} and task9task_{9}, the final convergence outcomes for task1task_{1} are consistently worse compared to training from scratch. This occurs because pre-training DecL on a single task leads to overfitting on that specific task, preventing the acquisition of general decision knowledge. Consequently, when we transfer a DecL pre-trained on a single task to a new task, it fails to align with the new task, resulting in a deterioration of the final convergence performance for the new task. Therefore, we need to use more pre-training tasks to extract DecL containing general decision-making knowledge.

Fix vs Fine-tune. When employing pre-trained models, we typically face the choice between weight fixation and fine-tuning. Our experiments explore both scenarios when transferring pre-trained DecL weights on Group1Group_{1} tasks to task8task_{8}. The experimental results, as depicted in Fig. 7, demonstrate that fixing the pre-trained DecL weights leads to efficient convergence to superior results. During the backpropagation process, the loss diminishes as it propagates from the network’s output to the input. DecL is closer to the network’s output, resulting in more significant parameter updates compared to PerL. In the early stages of training, PerL has not yet developed robust perceptual capabilities. Consequently, the transmission of substantial losses can disrupt the pre-trained DecL, leading to decreased convergence speed and inferior results. Thus, we consistently opt for the fixed-weight approach when employing pre-trained DecL.

Refer to caption
Figure 7: Win rates of training task8task_{8} when using a fixed pre-trained DecL, fine-tuning the pre-trained DecL, and not using the pre-trained DecL. Pre-trained DecL is obtained by pre-training on the Group1Group_{1} tasks.
Refer to caption Refer to caption Refer to caption
(a) task2task_{2} w/o pre-trained DecL at 66M steps (b) task7task_{7} w/o pre-trained DecL at 1010M steps (c) task4task_{4} w/o pre-trained DecL at 0.60.6M steps
Refer to caption Refer to caption Refer to caption
(d) task2task_{2} with pre-trained DecL at 66M steps (e) task7task_{7} with pre-trained DecL at 1010M steps (f) task4task_{4} with pre-trained DecL at 0.60.6M steps
Figure 8: Visualization of model performance for different tasks with and without pre-trained decision modules at the same training steps.

5.3 Visualization

To better illustrate how our proposed multi-task multi-agent reinforcement learning algorithm learns to share common decision knowledge between different tasks, we visualize the acquired policies in the SMAC tasks. The videos of the battle are provided in the supplementary material.

In Fig. 8(a) and (d), we visualize the performance on task2task_{2} after training for 66 million steps with and without the pre-trained DecL using Group2Group_{2} tasks. In Fig. 8(a), the attack positions of all Marines remain nearly unchanged, displaying a lack of flexibility, even when their health is low. However, in Fig. 8(d), the Marines exhibit a more flexible policy by moving to disperse enemy fire and concentrating their attacks on a portion of the enemy forces.

Fig. 8(b) and (e) visualize the performance on task7task_{7} after training for 1010 million steps without pre-trained DecL and with DecL pre-trained on Group1Group_{1} tasks, respectively. In Fig. 8(b), the Medivac does not employ an evasive strategy, making it vulnerable to enemy attacks, resulting in losses to our Medivac, while the health of other agents cannot recover promptly. Additionally, the positioning of Marines and Marauders lacks strategic depth, failing to execute timely retreats when their health is low. However, in Fig. 8(e), the agents have acquired advanced policies: Marines with low health retreat promptly, while healthier Marines draw enemy Marauders’ fire, allowing other Marines to flank the enemy Marauders for surprise attacks stealthily.

In Fig. 8(c) and (f), we visualize the performance on task4task_{4} after training for 0.60.6 million steps, comparing scenarios with and without pre-trained DecL from Group3Group_{3} tasks. In Fig. 8(c), all Banelings charge towards the opponent to detonate them, while some Zerglings also advance but are damaged by the Banelings’ explosions. However, in Fig. 8(f), all Banelings immediately rush towards the opposing camp, while Zerglings quickly retreat to avoid unintentional harm from the Banelings, resulting in minimal casualties.

In summary, by utilizing a pre-trained DecL incorporating generic decision knowledge to train for new tasks, we typically acquire more effective policies within the same training steps, thus achieving superior results more efficiently.

5.4 Google Research Football (GRF)

We conduct experiments on six tasks from GRF, as shown in table 3. We choose four tasks, including both easy and hard difficulties (task11,task13,task14,task15)(task_{11},task_{13},task_{14},task_{15}), for pre-training the DecL. Subsequently, we transfer the pre-trained DecL to (task12(task_{12} of easy difficulty and (task16(task_{16} of super hard difficulty. The experimental outcomes are depicted in Fig. 9.

Refer to caption Refer to caption Refer to caption
(a) pre-training of 4 GRF tasks (b) from four tasks to task12task_{12} (c) from four tasks to task16task_{16}
Figure 9: Win rates of “pre-training DecL using four GRF tasks" and “transferring DecL pre-trained on GRF tasks to the new GRF task".
Table 3: The maps in the GRF task used in the experiment and their corresponding difficulty.
Task Map Difficulty
task11task_{11} Run to Score with Keeper Easy
task12task_{12} Pass and Shoot with Keeper Easy
task13task_{13} Run, Pass and Shoot with Keeper Easy
task14task_{14} 3 versus 1 with Keeper Hard
task15task_{15} Easy Counter-Attack Hard
task16task_{16} Hard Counter-Attack Super Hard

Fig. 9 (a) illustrates that the APN network, during multi-task pre-training, effectively learns policies across various GRF tasks, each involving a different number of agents. Fig. 9 (b) demonstrates that with the pre-trained DecL, the convergence speed and final convergence results for task12task_{12} of easy difficulty significantly improve compared to training from scratch. Fig. 9 (c) reveals that, with the pre-trained DecL, the final convergence results for task16task_{16} of super hard difficulty are superior to those obtained from training from scratch. The utilization of pre-trained DecL renders training on new GRF tasks more efficient. Notably, task16task_{16} fails to acquire effective policy when trained from scratch. In summary, through multi-task pre-training, we can obtain a DecL containing common decision knowledge, which enables us to more efficiently acquire policies for new tasks, substantially reducing training costs.

6 Conclusion

In this paper, we propose a multi-task multi-agent reinforcement learning algorithm that aims to obtain shared decision modules through multi-task pre-training. To ensure consistency in the format of shared knowledge across tasks and action spaces, we introduce the action repositioning network (APN). By relocating actions from the output of the decision-making module to the input of the perception module, we maintain a consistent output dimension for the decision-making module across various agents and tasks. Additionally, we address the challenge of disparate training speeds among different tasks during multi-task training by dynamically adjusting task weights, resulting in accelerated pre-training.

Through extensive experiments, we demonstrate that the pre-trained shared decision-making module contains general multi-agent decision-making knowledge. Moreover, as we increase the number of tasks used for pre-training, we observe that the shared decision layer contains more general decision knowledge, further improving the training efficiency for new tasks. By leveraging a decision module that incorporates this general decision knowledge, we significantly reduce the training cost for new tasks.

In future work, we plan to expand the number of pre-training tasks and train a larger decision-making module that encompasses a broader spectrum of general decision-making knowledge. Additionally, we aim to explore the transferability of decision-making knowledge across different domains, such as applying knowledge from StarCraft to tasks in Dota or Honor of Kings. Given the existence of common decision-making knowledge between these tasks, we strongly believe this idea holds considerable promise for future research.

References

  • [1] Tong Wu, Pan Zhou, Kai Liu, Yali Yuan, Xiumin Wang, Huawei Huang, and Dapeng Oliver Wu. Multi-agent deep reinforcement learning for urban traffic light control in vehicular networks. IEEE Transactions on Vehicular Technology, 69(8):8243–8256, 2020.
  • [2] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial Informatics, 9(1):427–438, 2012.
  • [3] Maximilian Hüttenrauch, Adrian Šošić, and Gerhard Neumann. Guided deep reinforcement learning for swarm systems. arXiv preprint arXiv:1709.06011, 2017.
  • [4] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • [5] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. 2019.
  • [6] Deheng Ye, Guibin Chen, Wen Zhang, Sheng Chen, Bo Yuan, Bo Liu, Jia Chen, Zhao Liu, Fuhao Qiu, Hongsheng Yu, et al. Towards playing full moba games with deep reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 621–632, 2020.
  • [7] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019.
  • [8] Karol Kurach, Anton Raichuk, Piotr Stańczyk, Michał Zając, Olivier Bachem, Lasse Espeholt, Carlos Riquelme, Damien Vincent, Marcin Michalski, Olivier Bousquet, et al. Google research football: A novel reinforcement learning environment. In AAAI Conference on Artificial Intelligence (AAAI), volume 34, pages 4501–4510, 2020.
  • [9] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big Data, 3(1):1–40, 2016.
  • [10] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.
  • [11] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • [12] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), pages 1597–1607. PMLR, 2020.
  • [14] Sasha Targ, Diogo Almeida, and Kevin Lyman. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029, 2016.
  • [15] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
  • [16] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV), pages 213–229. Springer, 2020.
  • [17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems (NeurIPS), 28, 2015.
  • [18] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10781–10790, 2020.
  • [19] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision (IJCV), 129:3069–3087, 2021.
  • [20] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [21] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. Ernie: Enhanced language representation with informative entities. arXiv preprint arXiv:1905.07129, 2019.
  • [22] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  • [23] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  • [24] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  • [25] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in Neural Information Processing Systems (NeurIPS), 27, 2014.
  • [26] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. Ieee, 2009.
  • [27] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et al. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages, 2(3):18, 2017.
  • [28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
  • [29] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS), 25, 2012.
  • [30] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [32] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
  • [33] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [34] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
  • [35] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision (ECCV), pages 319–345. Springer, 2020.
  • [36] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  • [37] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9729–9738, 2020.
  • [38] Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pages 10268–10278. PMLR, 2021.
  • [39] Janne Spijkervet and John Ashley Burgoyne. Contrastive learning of musical representations. arXiv preprint arXiv:2103.09410, 2021.
  • [40] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  • [41] Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
  • [42] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017.
  • [43] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research (JMLR), 21(1):7234–7284, 2020.
  • [44] Jian Hu, Siyang Jiang, Seth Austin Harding, Haibin Wu, and Shih-wei Liao. Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2102.03479, 2021.