BCD Reward: Behavior and Context-Driven Reward for
Deep Reinforcement Learning in Human-AI Coordination

Author Name Affiliation email@example.com

Abstract

Deep reinforcement Learning (DRL) offers a powerful framework for AI agents to adapt to human partners by interpreting and anticipating their actions. However, challenges like sparse rewards and intricate human behavior hinder the effective exploration and exploitation of existing RL algorithms in human-AI coordination scenarios. To address these challenges, we propose an innovative behavior and context-driven (BCD) reward-enhanced DRL algorithm, which significantly increases the sparser rewards by utilizing the hidden features of the human coordinators’ behaviors and the training context. Specifically, we improve the exploration by designing novel intrinsic rewards to interpret human behaviors by observing counterfactual human actions, whilst intensifying the exploration of low-probability state-action pairs associated with sparse rewards. We enhance the exploitation by developing a novel context-aware weighting mechanism that alleviates the potential over-exploration caused by our intrinsic rewards, enabling the AI agent to prioritize actions that foster better coordination with the human partner. Extensive simulations in the Overcooked environment demonstrate that our approach increases the averaged sum of sparse rewards per epoch by approximately $20\%$ and reduces the epochs required for convergence by about $67\%$ compared to state-of-the-art baselines.

1 Introduction

Human-AI coordination is increasingly pivotal in addressing complex tasks that require the synergy of human intuition and machine autonomy for a shared goal AAAI_overcooked_Tsinghua; PantheonRL_aaai2022. Achieving effective coordination for enhancing efficiency and performance of the shared task necessitates an AI agent that can seamlessly adapt to the human partner by interpreting and anticipating human actions. Reinforcement Learning (RL) offers a powerful framework for developing such an adaptive agent due to its ability to learn the optimal policy through interactions. However, leveraging RL in human-AI coordination presents significant challenges.

Design a good reward is the key to achieving good performance. Existing work has done a lot, but … In human-AI coordination, rewards are typically granted only when a task is successfully completed, and a task often requires multiple steps, all of which need to be performed in a coordinated manner with the human coordinator. The overarching goal of RL is to maximize the sum of these sparse and delayed rewards within a time limit. One primary issue is that sparse and delayed rewards in RL fail to capture the nuances of human behavior, hindering the AI agent’s ability to align its actions with those of the human coordinator. Capturing subtle cues and adapting to the human partner’s strategies are essential for seamless coordination. In addition, sparse rewards increase the temporal gap between the reinforcement and the joint humans-AI actions. It is difficult for the agent to learn from individual interactions that are crucial for effective coordination, as there’s no immediate feedback on the utility of individual actions to determine which specific actions led to the successful completion of a coordination task.

The design/balance of exploration and exploitation is the key to designing the good/expected reward for our problem/scenario. There are issues with using existing methods.

Moreover, the state-action pairs associated with low-probability sparse rewards are critical for effective learning but are rarely encountered during training due to the asymmetric behaviors of humans and AI agents as well as the complex dynamics of human-AI coordination combine_human_aaai_2019—especially when the AI agent cannot fully interpret the intricate dynamics of human behavior multiple_feedback_aaai_2023; balance_human_aaai_2020. These rare but significant interactions often represent key moments where the AI agent’s actions align perfectly with the human partner’s intentions, leading to successful task completion. Without sufficient exposure to these critical state-action pairs, the AI agent struggles to understand which actions are beneficial, resulting in slow learning and suboptimal performance. A common approach to mitigating the issue is to augment exploration to improve the acquisition of sparse rewards at the expense of exploitation ICML2019_Social_MARL_GitHub; max_entropy_reward. However, this leads to low training efficiency NIPS2017_MAAC; Cooperative_IRL; ex_ex_self_supervised, due to the ingrained trade-off between exploration and exploitation in RL Sutton_FlowerBook; classical_A3C; ex_ex_MARL.

To address these challenges that hinder effective exploration and policy optimization during learning, we propose a novel design of behavior and context-driven (BCD) rewards for the RL-based AI agent. The novel contributions of this work are summarized as follows:

•

We develop human-aware intrinsic rewards that enable the AI agent to better understand and anticipate human partner intentions by considering possible alternative human actions under various circumstances.
•

We enhance the agent’s exploration by integrating a logarithmic term into the intrinsic rewards, encouraging diverse action selection and increasing the likelihood of encountering critical yet rare state-action pairs associated with sparse rewards.
•

We introduce a context-aware weighting mechanism that dynamically adjusts the balance between intrinsic and extrinsic rewards based on coordination effectiveness across task performance, human behavior, and agent diversity, thereby optimizing exploitation and improving training efficiency.

By integrating intrinsic rewards that reflect human behavior and encourage the exploration of critical state-action pairs, along with a context-aware weighting mechanism, experimental results demonstrate that our approach outperforms state-of-the-art algorithms and enhances the AI agent’s ability to learn and adapt in the settings of coordination. This leads to increased performance and a significant reduction in training time, illustrating an innovative path for effective human-AI coordination.

2 Related Work

Existing work mainly focuses on designing rewards for improving the effectiveness in two aspects: exploration and exploitation.

2.0.1 Exploration Improvement

Intrinsic motivation.

Unlike traditional extrinsic rewards directly obtained from the environment, intrinsic rewards are usually derived from AI agents’ observations and can increase the likelihood of identifying sparse state-action pairs in complicated scenarios/environments. Typical intrinsic rewards, such as maximum entropy max_entropy_reward and causal influence ICML2019_Social_MARL_GitHub, enhance exploration by focusing on the actions of the AI agent and its coordinators, respectively. However, this effectiveness achieved by using intrinsic rewards has not been validated/applied/transferred in human-AI coordination scenarios with sparse rewards. But this is not human-AI coordination.

More recent work AAAI_overcooked_Tsinghua proposed to train the AI agent to adapt the human without using human data, by assuming the human is super diverse. However, this approach makes the AI agent super adaptive, but it sacrifices the final performance. This is because the no free lunch theorems NoFreeLunch suggest that exploiting a specific subclass of problems is necessary for improved performance bySGD_bySGD; Xin_HML; Xin_ICC_GNN. To bridge the gap between sparse external rewards and the nuanced human behaviors critical for coordination, we introduce innovative intrinsic rewards design. Specifically, the proposed intrinsic rewards consider human behavior by observing counterfactual human actions—the actions that the human partner could have taken under different circumstances. By incorporating these insights, the AI agent gains a deeper understanding of the human collaborator’s intentions and preferences.

Regularization.

In RL from human feedback (RLHF) related works Christiano_NIPS2017_RLHF; Stiennon_RLHF_NIPS2020, researchers have identified to use of the logarithmic term of the equation of the intrinsic reward, also known as a regularization term. Using only the logarithmic term can significantly increase the effects of the intrinsic reward, thereby increasing the exploration of the low-probability events, such as sparse state-action pairs leading to targeted sparse reward. However, with the training process, the sparse state-action pairs leading to targeted sparse reward are becoming a higher probability event. So, always using the log term may lead to some disasters.

2.0.2 Exploitation Improvement

Task-aware weights.

Inspired by adaptive weighting in multi-task deep learning, which modifies task importance based on loss changes CVPR2019_MultiTask_attention; Transformer, we propose a novel approach that incorporates context-aware weights into the reward structure to enhance the training efficiency of AI agents. However, if we apply this technique from multi-task attention-based DL, in the late/exploitation stage of training, the extrinsic reward is stable, with minimum changes. Thus, we would have significant exploitation of the intrinsic reward; the bad thing is it will lead to instability in the late/exploitation stage of training.

Reward Shaping???

We plan to exploit the sparse reward after the early training stages, because only the sparse reward is our ultimate goal. In the latter training stage, exploiting the other rewards will distract the AI agent’s attention from achieving the ultimate goal.

3 Background of Human-AI Coordination

One of the primary objectives in human-AI coordination is to achieve the shared goal through seamless coordination between AI agents and humans combine_human_aaai_2019; balance_human_aaai_2020. We aim to improve the effectiveness and efficiency of human-AI coordination by enhancing the AI agent’s ability to adapt to human behaviors and dynamic environments.

Benchmark environment. Overcooked environment is a widely recognized benchmark for validating human-AI coordination NeurIPS2019_Overcooked_GitHub. It presents complicated cooking tasks accomplished by the coordination of human and AI agents within a limited timesteps. For example, the first layout of the Overcooked environment, Cramped Room, human and AI agents work together in a shared confined space. This layout poses challenges to coordination as it often leads to collisions between human and AI agents. In contrast, the second layout, Asymmetric Advantages, necessitates high-level strategic planning as agents begin in distinct areas with different accessibilities of the cooking resources/subtasks.

Human coordinator. In our experiments, we evaluate our approach designed for optimizing the AI agent’s policy with the human model commonly used in human-aligned RL research AAAI_overcooked_Tsinghua; NeurIPS2019_Overcooked_GitHub; Stiennon_RLHF_NIPS2020; Christiano_NIPS2017_RLHF. The human model is pre-trained from extensive human behavior data, which offers a scalable and practical alternative to real-time human interaction. By rigorously testing in the Overcooked human-AI coordination environment, we gain insights into the challenges and solutions for achieving effective and efficient human-AI coordination in real-world scenarios.

AI agent The AI agent …

Refer to caption — Figure 1: The BCD reward-enhanced DRL for human-AI coordination. The AI agent is trained using an RL algorithm, which integrates extrinsic rewards, intrinsic rewards, and context-aware weights. Extrinsic rewards are derived from the environment, while intrinsic rewards are based on the behaviors of both the human and AI agents. Context-aware weights are adjusted according to the episode return—across three domains task performance, human behavior, and AI agent behavior.

4 BCD Reward-driven DRL

In this section, we first provide an overview of the BCD reward-driven DRL (BCD-DRL). Subsequently, we clarify six key definitions integral to IReCa’s detailed design. Finally, we present the design specifics of the BCD Reward.

4.1 BCD-DRL Overview

As illustrated in Fig. 1, the core concept of our approach lies in the design of the BCD reward, which is composed of three key elements: (1) standard extrinsic rewards derived from interactions with the environment, (2) innovative intrinsic rewards based on the actions of both AI and human agents, and (3) context-aware weights that automatically balance exploration and exploitation by dynamically adjusting the importance of extrinsic and intrinsic rewards. Using the BCD reward, the AI agent leverages an RL algorithm, such as proximal policy optimization (PPO), to interact and collaborate with human partners. We follow the methodology outlined inNeurIPS2019_Overcooked_GitHub, employing a human model pre-trained using a behavior cloning algorithm on human data. The use of an RL algorithm is essential, as it allows the AI agent to learn from its interactions with both the human coordinator and the environment.

Let $t$ denote the index of timesteps. A timestep is a unit of measurement used to denote the progression of time within the whole learning process of an agent interacting with its environment. The agent performs an interaction with the environment in each timestep. Let $n$ denote the index of epochs. The duration of each epoch is a predefined number $T$ of timesteps. The policy of the AI agent is updated after each epoch, using the data corresponding to that epoch. Then, the BCD reward of the AI agent is given by

r_{t}=\kappa_{n}^{\mathcal{E}}r_{t}^{\mathrm{E}}+\kappa_{n}^{\mathrm{A}}r_{t}^{\mathrm{A}}+\kappa_{n}^{\mathrm{H}}r_{t}^{\mathrm{H}},

(1)

where $r_{t}^{\mathrm{E}}$ is the extrinsic reward obtained from the environment in the $t$ -th timestep, $r_{t}^{\mathrm{A}}$ and $r_{t}^{\mathrm{H}}$ are a pair of intrinsic rewards in the $t$ -th timestep, which encourages the AI agent to explore the environment comprehensively from the distinct behaviors of AI and its human coordinator. The $\kappa_{n}^{\mathcal{E}}$ , $\kappa_{n}^{\mathrm{A}}$ , and $\kappa_{n}^{\mathrm{H}}$ are the context-aware weights in the $n$ -th epoch, where $n=\lfloor t/T\rfloor$ . These weights change epoch by epoch. We note that the use of distinguishable superscripts for the extrinsic reward and its corresponding context-aware weight is intentional to emphasize their distinct design rationale, which will be elaborated in the following subsections.

4.2 BCD Reward Design

4.2.1 Extrinsic Reward

Achieving the desired sparse rewards in complicated human-AI coordination scenarios often requires leveraging the intermediate-stage rewards to guide the exploration and exploitation of key preliminary actions that lead to the target sparse rewards reward_shaping_WED; reward_shaping_nips2020. For example, in the Overcooked environment, placing an onion in the pot is a crucial preliminary action for earning the sparse reward associated with cooking onion soup. Stage rewards are typically present during the early stages of training but tend to diminish as training progresses. We follow the design outlined in NeurIPS2019_Overcooked_GitHub, in which the extrinsic reward is composed of the target sparse reward, $r_{t}^{\mathrm{E}_{\mathrm{S}}}$ , and a linearly fading stage reward, $r_{t}^{\mathrm{E}_{\mathrm{G}}}$ , and is given by:

\begin{split}r_{t}^{\mathrm{E}}&=r_{t}^{\mathrm{E}_{\mathrm{S}}}+\max\left\{0,1-\lambda^{\mathrm{E}_{\mathrm{G}}}\cdot t\right\}r_{t}^{\mathrm{E}_{\mathrm{G}}},\end{split}

(2)

where $\max\{\cdot\}$ is the maximum function, and $\lambda^{\mathrm{E}_{\mathrm{G}}}$ is a constant coefficient of the extrinsic stage reward that controls the fading speed of the stage reward. Both the sparse reward and the stage reward are derived from the environment and represent the joint contributions of both the AI and human agents to the task. It seems that we need to give the expression of $r_{t}^{\mathrm{E}_{\mathrm{S}}}$ and $r_{t}^{\mathrm{E}_{\mathrm{G}}}$ . For example,

r_{t}^{\mathrm{E}_{\mathrm{S}}}=\begin{cases}???,&\text{if shared goal is achieved}\\ 0,&\text{otherwise}\end{cases},\forall t>0.

(3)

$r_{t}^{\mathrm{E}_{\mathrm{G}}}$ changes over time? or it should be a constant like $r^{\mathrm{E}_{\mathrm{G}}}$ and fade by multiplying $\max\left\{0,1-\lambda^{\mathrm{E}_{\mathrm{G}}}\cdot t\right\}$ .

4.2.2 Intrinsic Reward

While stage rewards can assist in this process, they also pose the risk of leading the learning process into local optima due to excessive exploitation. To counteract this limitation, intrinsic rewards serve as a powerful mechanism to enhance the exploration capabilities of AI agents intrinsic_reward_aaai; causal_MARL_aaai.

In human-AI coordination scenarios with sparse rewards, to effectively collaborate with human coordinators who behave diversely and sparsely, a focus on the critical state-action pairs for obtaining those sparse rewards is necessary. Drawing inspiration from the approach in Stiennon_RLHF_NIPS2020, which introduced a regularization term that improved upon state-of-the-art reward functions, we propose a novel method. Traditional regularization terms such as entropy, KL divergence, and cross-entropy often underestimate the impact of these sparse state-action pairs due to their low probability of occurrence. To address this, we introduce a logarithmic term to emphasize the significance of these low-probability pairs, thereby enhancing the learning process.

Our intrinsic reward consists of two components: the self-motivated intrinsic reward and the human coordinator-motivated intrinsic reward.

a) Self-motivated intrinsic reward.

The self-motivated intrinsic reward is designed to encourage the AI agent to adopt a more diverse policy, and it is defined as

\begin{split}r_{t}^{\mathrm{A}}&=\lambda^{\mathrm{A}}\cdot\mathbb{E}_{\pi}\left[-\log\left(\pi(a_{t}^{\mathrm{A}}\big{|}o_{t}^{\mathrm{A}})\right)\right],\end{split}

(4)

where $\lambda^{\mathrm{A}}$ is a constant coefficient that determines the significance of the self-motivated intrinsic reward, $\mathbb{E}[\cdot]$ denotes the expectation, $\pi(\cdot)$ is the AI agent’s policy, $a_{t}^{\mathrm{A}}$ represents the AI agent’s action at the $t$ -th timestep, and $o_{t}^{\mathrm{A}}$ is the AI agent’s observation at the same timestep.

b) Human coordinator-motivated intrinsic reward.

To enhance coordination between the AI agent and the human, prior research ICML2019_Social_MARL_GitHub has explored the benefits of encouraging coordinators to adjust their behavior based on the AI agent’s actions. However, in human-AI coordination scenarios, a key challenge arises from the fact that human coordinator models are often pre-trained and remain untrainable during the AI agent’s training phase NeurIPS2019_Overcooked_GitHub; Stiennon_RLHF_NIPS2020; Christiano_NIPS2017_RLHF.

To address this limitation, we propose optimizing the AI agent’s policy to be more adaptable, using the known actions of the human coordinator at each timestep. Similar to the intrinsic reward mechanism that encourages exploration by discounting low-probability events in the KL divergence, the human coordinator-motivated intrinsic reward is calculated by selectively emphasizing significant state-action pairs. This approach guides the AI agent toward more effective collaboration with the human coordinator. The human coordinator-motivated intrinsic reward is given by

r_{t}^{\mathrm{H}}=\lambda^{\mathrm{H}}\cdot\mathbb{E}_{\pi}\left[\left|\log\left(\frac{\pi\left(a_{t}^{\mathrm{A}}\big{|}o_{t}^{\mathrm{A}}\right)}{{\pi}\left(\tilde{a}_{t}^{\mathrm{A}}\big{|}\tilde{o}_{t}^{\mathrm{A}}\left(a_{t}^{\mathrm{H}},o_{t}^{\mathrm{A}}\right)\right)}\right)\right|\right],

(5)

where $\lambda^{\mathrm{H}}$ is a constant representing the weight of the intrinsic reward driven by the human agent. The term $|\ \cdot\ |$ denotes the absolute value function, and $\tilde{o}_{t}^{\mathrm{A}}\left(a_{t}^{\mathrm{H}},o_{t}^{\mathrm{A}}\right)$ represents the counterfactual observation of the AI agent when only the human agent takes action $a_{t}^{\mathrm{H}}$ at the $t$ -th timestep. ${\pi}\left(\tilde{a}_{t}^{\mathrm{A}}\big{|}\tilde{o}_{t}^{\mathrm{A}}\left(a_{t}^{\mathrm{H}},o_{t}^{\mathrm{A}}\right)\right)$ corresponds to the AI agent’s behavior in this counterfactual scenario.

These intrinsic rewards encourage the AI agent to explore diverse actions and increase the likelihood of encountering the critical low-probability state-action pairs associated with sparse rewards. This strategic exploration boosts the frequency of meaningful interactions during training, helping the AI agent learn human behavior more efficiently and effectively. By intensifying the exposure of these state-action pairs, our approach ensures that the agent efficiently learns behaviors that are most beneficial for coordination.

4.2.3 Context-aware Weights

Incorporating stage rewards into the extrinsic reward can enhance the acquisition of sparse rewards by guiding exploration. However, it also introduces the risk of the AI agent becoming trapped in local optima, particularly in the early training stages when the stage reward is greater than the sparse reward. While intrinsic rewards can help mitigate this issue by encouraging exploration, excessive exploration can also be detrimental. Therefore, a desirable policy must be context-aware – able to optimize exploration and exploitation based on the current situation. Specifically, the policy should increase exploration when the AI agent is at risk of converging to suboptimal solutions and favor exploitation when the current policy is performing effectively.

To achieve this balance, we introduce context-aware weights that adaptively adjust the AI agent’s exploration and exploitation strategies based on the evolving context. The core idea is to assign greater weight to intrinsic rewards when the extrinsic reward shows minimal improvement and reduce this weight when the extrinsic reward is increasing steadily. To maintain policy robustness, these context-aware weights are updated at each epoch, in synchronization with the policy updates.

We first define the average epoch return to quantify the changes in extrinsic and intrinsic rewards by

\begin{split}\bar{R}_{n}^{\mathcal{E}}\ \ &=\frac{1}{T}\sum_{t=nT}^{(n+1)T}r_{t}^{\mathcal{E}}=\frac{1}{T}\sum_{t=nT}^{(n+1)T}\left(r_{t}^{\mathrm{E}_{\mathrm{S}}}+r_{t}^{\mathrm{E}_{\mathrm{G}}}\right),\\ \bar{R}_{n}^{\mathrm{A}}&=\frac{1}{T}\sum_{t=nT}^{(n+1)T}r_{t}^{\mathrm{A}},\\ \bar{R}_{n}^{\mathrm{H}}&=\frac{1}{T}\sum_{t=nT}^{(n+1)T}r_{t}^{\mathrm{H}}.\end{split}

(6)

In our design, we assign equal weight to both sparse and stage returns, based on the premise that stage rewards correspond to preliminary actions that ideally lead to the target sparse rewards. Therefore, the ratio between sparse and stage returns should remain consistent, reflecting an effective policy. This approach contrasts with the design of the extrinsic reward, where the stage reward component linearly fades out over time, as it is not the ultimate objective. The primary goal is to maximize the sparse reward, which drives the desired behavior.

Next, based on the changes in the epoch returns, the context-aware weights are computed using the Softmax function as

\hat{\kappa}_{n}^{\mathcal{E}},\hat{\kappa}_{n}^{\mathrm{A}},\hat{\kappa}_{n}^{\mathrm{H}}=\lambda^{\mathrm{R}}\cdot\texttt{softmax}\left(\frac{\bar{R}_{n-1}^{\mathcal{E}}}{\bar{R}_{n}^{\mathcal{E}}},\frac{\bar{R}_{n-1}^{\mathrm{A}}}{\bar{R}_{n}^{\mathrm{A}}},\frac{\bar{R}_{n-1}^{\mathrm{H}}}{\bar{R}_{n}^{\mathrm{H}}}\right),

(7)

where $\lambda^{\mathrm{R}}$ is a constant coefficient that controls the magnitude of the context-aware weights. The rationale behind this design is straightforward: if the episode return is increasing, it suggests that the current policy is effective and requires minimal updates. Conversely, if the episode return is decreasing, it signals that the policy is underperforming and should be adjusted more significantly.

To prevent excessive exploration in later stages of training, we limit the exploration by applying a threshold on the episode number $N_{th}$ . Finally, the context-aware weights are determined as follows:

\begin{split}{\kappa}_{n}^{\mathcal{E}}&=\hat{\kappa}_{n}^{\mathcal{E}}\cdot\mathds{1}\{n<N_{th}\}+\mathds{1}\{n\geq N_{th}\},\\ {\kappa}_{n}^{\mathrm{A}}&=\hat{\kappa}_{n}^{\mathrm{A}}\cdot\mathds{1}\{n<N_{th}\},\\ {\kappa}_{n}^{\mathrm{H}}&=\hat{\kappa}_{n}^{\mathrm{H}}\cdot\mathds{1}\{n<N_{th}\},\end{split}

(8)

where $\mathds{1}(\cdot)$ is the indicator function.

This context-aware weighting mechanism adjusts the influence of intrinsic and extrinsic rewards based on the effectiveness of the coordination across three domains—task performance, human behavior, and AI agent diversity. It allows the AI agent to dynamically balance exploration and exploitation, prioritizing actions that foster better coordination with the human partner and thereby improving training efficiency.

4.3 Training Algorithm

The BCD-DRL agent is adapted from PPO. A detailed explanation of PPO can be found in ppo_classic. In the following, we only present the high-level concepts and the key steps of BCD-DRL.

4.3.1 Neural Networks

It has a pair of actor and critic neural networks (NNs), where the NN parameters, including weights and biases, are denoted by $\theta$ and $\varphi$ , respectively. The actor NN generates an action $a_{t}^{\mathrm{A}}$ with stochastic policy $\pi(a_{t}^{\mathrm{A}}|o_{t}^{\mathrm{A}};\theta)$ , which is a probability density function of $a_{t}^{\mathrm{A}}$ given the current state $o_{t}^{\mathrm{A}}$ . The critic NN estimates the state-value function $V(o_{t}^{\mathrm{A}};\varphi)$ given the actor’s policy $\pi(a_{t}^{\mathrm{A}}|o_{t}^{\mathrm{A}};\theta)$ , i.e.,

V(o_{t}^{\mathrm{A}};\!\varphi)\!\approx\mathbb{E}\!\left[\sum_{k=t}^{\infty}\gamma^{t-k}r_{k}\bigg{|}\pi(a_{k}^{\mathrm{A}}|o_{k}^{\mathrm{A}};\theta),\forall k\!\geq\!t\right]\!\!,

(9)

where $\gamma$ is the discount factor.

Please briefly introduce the architecture of these two NNs.

4.3.2 Training Process

The training of the BCD-DRL agent alternates between experience generation and policy update. The step-by-step training process of BCD-DRL is given in Algorithm 1.

Algorithm 1 BCD-DRL for human-AI coordination

1:Episode number

E

, maximum steps per episode

T

, number of epochs for NN updating

K

, discount factor

\beta

, smoothing factor

\alpha

of generalized advantage estimator, mini-batch size

B

, clip factor

\omega

, the entropy loss weight factor

w

, the size of the actor output layer

C

2:Well-trained actor network

\pi^{*}(\cdot|\cdot)

3:Initialize actor network

\pi(\cdot|\cdot)

and critic network

V(\cdot)

with random parameter

\theta

and

\varphi

, respectively; Initialize the AoI of all sensors to zeros; Randomly initialize the channel states of all sensors.

4:for episode =1,…,

E

5: for t=0,…,

T

6: Generate experiences

<\mathbf{s}(t),\tilde{\mathbf{a}}(t),r(t)>

by following the current policy

\pi(\cdot|\cdot)

7: end for

8: for t=0,…,

T

9: Compute the advantage function

A(t)

in (10) and the reward-to-go

R(t)

in (11).

10: end for

11: for epoch =1,…,

K

12: Sample a random mini-batch data set of size

B

from the experiences.

13: Update the critic parameters

\varphi

by minimizing the loss function

L_{\text{C}}(\varphi)

in (12).

14: Update the actor parameters

\theta

by minimizing the actor loss function

L_{\text{A}}(\theta)

in (13).

15: end for

16:end for

a) Experience generation.

By using current policy $\pi(\cdot|\cdot;\theta_{\text{old}})$ , the BCD-DRL agent samples data $(o_{t}^{\mathrm{A}},\tilde{o}_{t}^{\mathrm{A}}\left(a_{t}^{\mathrm{H}},o_{t}^{\mathrm{A}}\right),a_{t}^{\mathrm{A}},r_{t})$ with length of $T$ through interacting with the environment during $n$ -th epoch. By leveraging the generated experience, the advantage function $A_{t}$ and the reward-to-go function $R_{t}$ for each timestep in $n$ -th epoch can be calculated as

		$\displaystyle A_{t}=\sum_{k=t}^{(n+1)T}(\gamma\alpha)^{k-t}(r_{k}+\gamma V(o_{k+1}^{\mathrm{A}};\varphi)-V(o_{k}^{\mathrm{A}};\varphi)),$		(10)
		$\displaystyle R_{t}=r_{t}+\gamma V(o_{t+1}^{\mathrm{A}};\varphi),$		(11)

where $\alpha$ is a hyper-parameter named as the smoothing factor.

b) Policy update.

The BCD-DRLPPO agent creates a mini-batch data set by randomly sampling $B$ data from the generated $T$ -length experience earlier, i.e., $\{(\mathbf{s}(t_{i}),\tilde{\mathbf{a}}(t_{i}),A(t_{i}),R(t_{i}))\}$ where $t_{i}\in\{t_{1},\dots,t_{B}\}\subset\{0,\dots,T-1\}$ . The loss function for updating the critic NN is defined as

L_{\text{C}}(\varphi)=\frac{1}{B}\sum_{i=1}^{B}\left(R(t_{i})-V\left(\mathbf{s}(t_{i});\varphi\right)\right)^{2}.

(12)

This is regarded as the temporal difference error, specifying difference of the state-value estimations based on time steps $t_{i}$ and $t_{i}+1$ .The loss function for updating the actor NN has been elaborately designed for achieving high training stability and is much more complex:

		$\displaystyle L_{\text{A}}(\theta)=$		(13)
		$\displaystyle\frac{1}{B}\!\!\sum_{i=1}^{B}\!\left(\min\!\left\{p(\mathbf{s}(t_{i});\!\theta)A(t_{i}),\!c\left(\mathbf{s}(t_{i});\!\theta\right)A(t_{i})\right\}\!+\!\delta\hbar_{\theta_{\text{old}}}\!\left(\tilde{\mathbf{a}}(t_{i})\right)\right)\!,$		(13)

where $c(\mathbf{s}(t_{i});\theta)\!=\!\max\!\left\{\min\!\left\{p(\mathbf{s}(t_{i});\theta),1\!+\!\omega\right\},1\!-\!\omega\right\}$ is a clip function with a hyper-parameter $\omega$ to set a variation constraint of the policy update, and $p(\mathbf{s}(t_{i});\theta)=\frac{\pi(\tilde{\mathbf{a}}(t_{i})|\mathbf{s}(t_{i});\theta)}{\pi\left(\tilde{\mathbf{a}}(t_{i})|\mathbf{s}(t_{i});\theta_{\text{old }}\right)}$ is a density ratio. $\hbar_{\theta_{\text{old}}}\!\!\left(\tilde{\mathbf{a}}(t_{i})\right)$ is the entropy loss function of $\tilde{\mathbf{a}}(t_{i})$ . Since $\tilde{\mathbf{a}}(t_{i})$ follows a Gaussian distribution, its entropy loss can be directly obtained based on its standard deviation $\boldsymbol{\sigma}(\mathbf{s}(t_{i});\theta_{\text{old}})$ . $\delta$ denotes the entropy loss weight factor to determine the strength of the exploration. Then, the critic NN and the actor NN parameters $\varphi$ and $\theta$ can be updated by optimizing the loss functions $L_{\text{C}}(\varphi)$ and $L_{\text{A}}(\theta)$ , respectively, using the widely adopted Adam optimizer. After training, the virtual action $\tilde{\mathbf{a}}(t)$ can be generated deterministically based on the maximum likelihood method for online deployment, i.e., $\tilde{\mathbf{a}}(t)=\boldsymbol{\mu}(\mathbf{s}(t);\theta)$ .

Hyper-parameters	Values
Episodes in each epoch	$5$
Episode length	$400$
Clip-ratio	$0.03$
Learning rate of actor network	$1\times 10^{-4}$
Learning rate of critic network	$3\times 10^{-4}$
$\lambda^{\mathrm{E}_{\mathrm{G}}}$ for $r_{t}^{\mathrm{E}_{\mathrm{G}}}$ in (2)	$4\times 10^{-6}$
$\lambda^{\mathrm{A}}$ for $r_{t}^{\mathrm{A}}$ in (4)	$0.02$
$\lambda^{\mathrm{H}}$ for $r_{t}^{\mathrm{H}}$ in (5)	$1$
$\lambda^{\mathrm{R}}$ for context-aware weights in (7)	$3$

Table 1: Key Hyper-parameters

5 Experiments

5.1 Environment

Our experiments are conducted in the Overcooked environment NeurIPS2019_Overcooked_GitHub, where a human agent and an AI agent work together to prepare as many onion soups as possible within a limited number of timesteps across a variety of layouts. The goal is to serve the soup in the designated area, earning a sparse reward of $20$ points for each successfully served soup. Achieving this requires completing a sequence of actions that correspond to stage rewards: picking up onions from a specified location, placing three onions into a pot, and serving the soup after cooking it for $20$ timesteps. In this collaborative scenario, both sparse and stage rewards are equally shared between the agents.

5.2 Agent Policies

For the human agent, we follow the methodology outlined in NeurIPS2019_Overcooked_GitHub and model the human agent’s behavior by using the behavior cloning policy. Specifically, the human behavior data is split into training and testing datasets code_github_Overcooked and the behavior cloning models used in the training and testing phases employed the respective datasets.

For the AI agent, we utilize the RL algorithm to develop its policy since we need to interact with its dynamic human coordinator and the environment. The RL algorithm’s state space corresponds to the Overcooked grid world, and its action space includes six discrete actions: move up, move down, move left, move right, stay, and interact. The “interact” action is context-sensitive, enabling the agent to pick up or place onions, plates, or soup, depending on the current state of the environment. Our implementation is based on the Gym-compatible Overcooked environment given in code_github_Overcooked.

To evaluate the performance of our proposed IReCa RL algorithm, we compare it with two baseline RL algorithms, training and testing the AI agent using each of the three policies.
PPO ${}_{\mathbf{\textit{BC}}}$ is a baseline algorithm employing the traditional proximal policy optimization (PPO) policy, where the reward consists of both the sparse reward and a linearly faded stage reward, as described in NeurIPS2019_Overcooked_GitHub.
Causal is a baseline algorithm that incorporates an intrinsic causal influence reward supplementing the extrinsic reward in PPO ${}_{\textit{BC}}$ . The causal influence reward encourages the AI agent to take actions that can lead to significant changes in its coordinator’s actions ICML2019_Social_MARL_GitHub.
IReCa is our proposed IReCa RL algorithm, which incorporates our innovative intrinsic rewards and the context-aware weights supplementing the extrinsic reward in PPO ${}_{\textit{BC}}$ .

For fair comparisons, these three RL algorithms share the same neural network architecture and hyper-parameters. The PPO structure follows the classical PPO algorithm given in ppo_classic; code_keras_ppo. The architecture of both the actor and critic networks comprise three convolutional layers (with filter sizes of $5\times 5$ , $3\times 3$ , and $3\times 3$ , each containing $25$ filters), followed by two fully connected layers with $64$ hidden units each. Key hyper-parameters are summarized in Table 1, and further details for parameter settings can be found in our GitHub repository.

Layouts of Overcooked human-AI coordination environment. Cramped Room presents low-level coordination challenges: in this shared, confined space it is very easy for the agents to collide. Asymmetric Advantages tests whether players can choose high-level strategies that play to their strengths NeurIPS2019_Overcooked_GitHub.

5.3 Results and Analysis

As shown in Fig. LABEL:fig_overcooked_layouts, we present experimental results in two distinct layouts of the Overcooked environment. The Cramped Room layout shown in Fig. LABEL:fig_layout_cramped_room poses significant low-level coordination challenges due to the confined space, leading to a high susceptibility to agent collisions. In contrast, the Asymmetric Advantages layout depicted in Fig. LABEL:fig_layout_asymmetric_advantages evaluates the ability of agents to adopt high-level strategies that leverage their individual strengths.

For each layout, the training phase consists of statistical results from $18$ independent experiments, with each experiment plotting the episode return over $400$ epochs, using a human model trained on the corresponding dataset. In the testing phase, we provide the average episode return from $400$ independent runs using the human model derived from the testing dataset.

In the experimental results, we address the following two key questions:
1) Is our IReCa RL algorithm more effective and efficient than the baseline algorithms?
2) If so, what factors contribute to the superior performance of IReCa compared to the baselines?

5.3.1 Toy Environment

Such as PacMan??? (from which perspective??? Effective Exploration???)

5.3.2 Overcooked Environment

We tested our IReCa algorithm in two different layouts of the Overcooked environment NeurIPS2019_Overcooked_GitHub.

Cramped Room Layout

Fig. LABEL:fig_cr_sim_results presents the simulation results for the Cramped Room layout of the Overcooked environment.

As shown in Fig. LABEL:fig_cr_sparse_return, the IReCa algorithm can achieve approximately $20\%$ higher average sparse episode returns compared to baseline methods during both the training and testing phases. These results indicate that IReCa is more effective than the baselines, and validate that our IReCa is more effective in identifying and exploring sparse state-action pairs that lead to targeted sparse rewards. Furthermore, the superior episode returns observed during testing experiments suggest that IReCa not only enhances exploration but also exhibits robustness and generalization capabilities, ensuring that the performance gains are due to effective exploration rather than overfitting to the training environment.

Previous studies have identified two primary factors contributing to performance degradation for AI agents in the Cramped Room layout: collisions with the human coordinator NeurIPS2019_Overcooked_GitHub and waiting for the human agent to vacate the AI agent’s preferred path AAAI_overcooked_Tsinghua. Our analysis extends this by hypothesizing that the AI agent may also become trapped in local optima by prioritizing stage returns over sparse rewards. This is because the stage episode returns are significantly higher than the sparse episode returns in the early training epochs, which may lead the AI agent to engage in unnecessary and redundant actions aimed at optimizing stage rewards.

Fig. LABEL:fig_cr_causal_analysis further investigates the role of intrinsic rewards in this process. As depicted in the figure, these intrinsic rewards are significantly higher than those in the causal baseline, helping the AI agent avoid local optima and achieve better overall performance. Since the primary objective in the Overcooked environment is task completion, independent exploration of the AI agent’s action space is crucial. The intrinsic rewards in IReCa facilitate this exploration, complementing the human agent’s actions and improving both coordination and efficiency.

Lastly, Fig. LABEL:fig_cr_context_aware_weights illustrates the evolution of context-aware weights during training. In the initial epochs, the weights assign higher values to intrinsic rewards and lower values to extrinsic rewards. This trend aligns with our expectation that comprehensive early-stage exploration will ultimately lead to better sparse rewards in human-AI coordination.

Asymmetric Advantages Layout

Fig. LABEL:fig_aa_sim_results presents the experimental results for the Asymmetric Advantages layout (see Fig. LABEL:fig_layout_asymmetric_advantages). In this layout, agents start in different areas and have access to two pots for cooking onion soup, which reduces the likelihood of collisions compared to the Cramped Room layout.

Similar to the Cramped Room results, Figs.LABEL:fig_aa_sparse_return, LABEL:fig_aa_stage_return, and LABEL:fig_aa_cau_analysis show that IReCa consistently outperforms the baseline algorithms in sparse, stage, and intrinsic episode returns, respectively. Interestingly, the gains of the episode returns are less pronounced during the training phases. This phenomenon is due to the reduced frequency of collisions and waiting times compared with the Cramped Room layout, as the agents are spatially separated and have access to more pots.

Despite this, IReCa reduces the training epochs required for convergence by approximately $67\%$ compared to the baselines. This acceleration is attributed to the context-aware weights (Fig. LABEL:fig_aa_context_aware_weights), which prioritize updates to extrinsic rewards and optimize the AI agent’s exploration of its action space. This trend of the context-aware weights is in contrast to the Cramped Room layout, where additional exploration is necessary to minimize waiting times.

The testing results in Figs. LABEL:fig_aa_sparse_return and LABEL:fig_aa_stage_return further support that in human-AI coordination scenarios employing the pre-trained human model, IReCa is more effective than traditional causal influence rewards that are valuable in purely MARL contexts.

6 Conclusion

To enhance human-AI coordination, we introduced an Intrinsic Reward-enhanced Context-aware (IReCa) reinforcement learning algorithm, which incorporated innovative intrinsic rewards to facilitate comprehensive exploration and novel context-aware weights to optimize exploration and exploitation, supplementing traditional extrinsic rewards. Extensive experimental training results in two layouts of the Overcooked environment demonstrated that our IReCa increased episode returns by approximately $20\%$ and reduced the number of epochs required for convergence by approximately $67\%$ . Testing experiments further underscored the algorithm’s robustness and generalization abilities across different human-AI coordination layouts. These findings highlighted IReCa’s potential applicability in real-world domains requiring critical human-AI coordination and environment demanding seamless human-AI interaction.

BCD Reward: Behavior and Context-Driven Reward for Deep Reinforcement Learning in Human-AI Coordination