Optimal Behavior Prior:
Data-Efficient Human Models
for Improved Human-AI Collaboration

Mesut Yang
Amazon Search Science and AI
&Micah Carroll
UC Berkeley
&Anca Dragan
UC Berkeley
Work completed while at UC Berkeley.

Abstract

AI agents designed to collaborate with people benefit from models that enable them to anticipate human behavior. However, realistic models tend to require vast amounts of human data, which is often hard to collect. A good prior or initialization could make for more data-efficient training, but what makes for a good prior on human behavior? Our work leverages a very simple assumption: people generally act closer to optimal than to random chance. We show that using optimal behavior as a prior for human models makes these models vastly more data-efficient and able to generalize to new environments. Our intuition is that such a prior enables the training to focus one’s precious real-world data on capturing the subtle nuances of human suboptimality, instead of on the basics of how to do the task in the first place. We also show that using these improved human models often leads to better human-AI collaboration performance compared to using models based on real human data alone.

1 Introduction

One of the ultimate goals of the field of AI is creating agents that are able to collaborate with us and help us achieve our goals. Such agents benefit from predictive models of human behavior – even if those models are simply used to simulate human behavior when training an RL policy [6]. Given the nuances and complexities of human behavior – our decisions being riddled with hundreds of identified cognitive biases [4, 20] – models of human behavior need to be grounded in real data.

At the same time, basing models solely on real human data has proven challenging. Human models tend to underperform relative to real humans [24] and fail to generalize to unseen or uncommon situations [33]. The generalization problem is exacerbated in collaboration, where using RL to compute an agent policy that is the best-response to the human model (i.e. the policy that is optimal for the AI agent given that the human will act according to the policy captured by the model) will create incentives for the RL system to exploit biases in the human model by exploring states during training that are from a different distribution than those induced by human-human collaboration.

Thus, the limited human data we have is very valuable, and should be used in the most efficient manner possible to learn nuances of human behavior. One way to do so might be to use low-capacity models of human behavior with high inductive bias, e.g. which leverage Theory of Mind [30, 8, 46, 21]. However, the quality of such models crucially depends on being able to add the correct inductive biases, and require extensive engineering effort. In contrast, we seek a solution that retains the expressivity of high-capacity models while addressing the data requirements from above.

Refer to caption — Figure 1: The Optimal Behavior Prior. By training an agent in self-play across many different environments, one can obtain a parameterization for optimal behavior which can be reused as an initialization (or equivalently, a prior of near-optimality) for human behavior modeling. This vastly improves human modeling data-efficiency relative to a random initialization, and can also help in training collaborative AI systems which are trained with said human models, enabling generalization to unseen environments.

Can we provide some structure for the model that, although useful in increasing the efficiency in our usage of human data, doesn’t reduce our model expressivity, i.e. its ability to improve to arbitrary levels of performance as the amount of data and the coverage increase? We hypothesize that one area with high potential for improvement is providing high-capacity human models with an informed prior about what human behavior should look like: generally, human models’ prior over human behavior (corresponding to their initialization) is random. But what makes for a good prior of human behavior? Our insight is that while human behavior deviates from optimality in nuanced ways which are hard-to-formalize, we should leverage the fact that people are at least trying to perform tasks optimally, and use optimal behavior as a prior.

This builds off of the intuition that for most human tasks, people’s behavior is closer to being optimal than to completely random behavior. Thus, this kind of approach will be all the more effective for settings in which human behavior is closer to optimal. But even if human behavior was not very optimal, we expect it to be much more sample-efficient to reduce a optimal model’s performance down to human-level (through human-data fine-tuning), rather than to learn to increase a random model’s performance up to being human-level. This kind of approach has two main advantages.

Firstly, it provides the model with a good starting representation for human behavior in the task: rather than wasting precious real human data on learning to represent basic aspects of the task (the prior takes care of that already) the learner can leverage the data for capturing the subtle nuances of human’s suboptimal behavior – following a similar idea behind pre-training and fine-tuning paradigms common in domains such as computer vision and NLP [31, 14].

Additionally, although our approach requires the extra component of access to optimal behavior, it is generally easier to come by than additional human data, which would otherwise be required for good human models: self-play methods [13] have been shown to work surprisingly well in obtaining near-optimal policies even in complex domains [36, 37, 44, 28]. To encode such an Optimal Behavior Prior (which we refer to as OBP), one can use the parameterization for near-optimal agents as a starting point for training one’s human model.

We test this idea in Overcooked-AI [6], a human-AI collaboration benchmark environment based on a video game by the same name [10]. In a simulated user study, we compare how different ways of modeling humans can affect a human models’ quality. We find that using OBP for our human models leads to qualitatively more realistic human behavior and much higher competence, enabling for the first time to obtain human models that generalize across different environments without being trained on specialized hard-coded state-featurizations. Additionally, we investigate whether such improved human model training regime can also improve downstream collaborative AIs trained using our human models. Again, we find that traditional behavior cloning models are not sufficiently realistic (in our data regime) to lead to good downstream collaborative AIs: collaborative agents obtained with simple behavior cloning perform even worse than self-play agents when paired with simulated humans. However, when using human models obtained with OBP for AI training, one can achieve better collaboration performance than under all other conditions.

Overall, our results show that OBP vastly increases human modeling data efficiency, suggesting that the technique could be widely applicable in any low-human-data human-modeling setting.

2 Related work

Human modeling. There has been a recent wave of interest in obtaining good human models for games [24, 18, 1]. One of the simplest approaches to imitation learning is given by behavior cloning, which learns a policy from expert demonstrations by directly learning a mapping from observations to actions using standard supervised learning methods [3, 42]. This kind of approach has been historically the most successful at imitating human behavior [25, 7, 38]. However, one common problem with methods like behavior cloning is that such models tend to underperform relative to real humans [24], in part likely due to insufficient expert data. Recently, [18] was able to obtain more realistic human models by leveraging search to improve the human model performance, while using a behavior cloning policy regularization to keep the policy human-like. We see our approach as at least complementary to theirs: OBP models would likely constitute better human models for regularization with their method. However, OBP human models might even be competitive in isolation when compared to search-based human models, as the former will have better performance relative to traditional behavior cloning methods, potentially obviating the need for search.

Human-AI collaboration. In this paper, we particularly focus on the benefit that improving human models can have on improving Human-AI collaboration: building agents that can collaborate with humans [6, 27]. This is related to the ad-hoc coordination problem – coordination with unknown teammates [39, 2] – but differs from it in that we assume that the teammate will be human. While it has been claimed that one might be able to obviate the need for human data to collaborate with humans [40], this will not be sufficient in settings in which capturing nuances of human suboptimal behavior is essential for the tasks at hand. The knowledge that the partner will be a human can either be encoded directly through handcrafted inductive biases [16, 46, 15, 40, 29], or by leveraging human data to varying degrees. Some approaches use both human data and inductive biases: first identifying which Nash equilibrium humans tend to play, and then biasing agents trained in self-play to learn the same equilibrium [22, 43]. However, these approaches require making strong assumptions about what test-time human behavior will look like, which may not hold in practice. This has led to various approaches to achieving human-AI collaboration which rely on learning models of human behavior and computing best-responses to them [6, 17]. Our approach is closest to [6]: we first train a human model to capture human behavior, and then train a best response to it to obtain a collaborative agent. Despite requiring more human data, this enables to make no assumption about the form of human behavior (as it is learned from data). In this work, we try to make access to vast quantities of human data less of a bottleneck, both for quality of human models, and of AIs meant to collaborate with humans.

Generalization to new environments. Much work has been focused on creating agents that can perform well in a variety of environments, especially for the purposes of sim-to-real transfer [48]. One common technique is domain randomization, which involves defining a distribution over environments $\mathcal{E}$ and training the agent to perform well in any randomly drawn environment [47, 41, 19]. We use this approach for our agent training in order to have it generalize to new environments at test time.

3 Preliminaries

Multi-agent MDPs. A $n$ -player MDP $\mathcal{M}$ is defined by a tuple $\big{\langle}\mathcal{S},\{\mathcal{A}_{i}\}_{1:n},\mathcal{P},\mathcal{T},\gamma,\mathcal{R},H\big{\rangle}$ . $\mathcal{S}$ is a finite set of states; $\mathcal{A}_{i}$ is the finite set of actions available to agent $i$ ; $\mathcal{P}$ is the initial state distribution; $\mathcal{T}:\mathcal{S}\times\mathcal{A}_{1}\times\dots\times\mathcal{A}_{n}\times\mathcal{S}\to[0,1]$ is a transition function which specifies the distribution over the next state, given all agents’ actions; $\gamma$ is the discount factor, and lastly $\mathcal{R}:\mathcal{S}\to\mathbb{R}$ is a real-valued reward function. We additionally denote the expected reward obtained by a policy tuple $\pi_{1},\dots,\pi_{n}$ by $\rho(\pi_{1},\dots,\pi_{n})=\mathbb{E}_{\pi_{1},\dots,\pi_{n}}[\sum^{H}_{t=0}r_{t}]$ .

Embedding agents in multi-agent MDPs. If we are given every agents’ policy except for the $k$ th policy $\{\pi_{i}:i\neq k\}$ , then considering the multi-agent MDP $\mathcal{M}$ from the perspective of agent $k$ , the other agents can be considered as “part of the transition dynamics of the environment”. The problem of finding the optimal $\pi_{k}$ of this multi-agent MDP $\mathcal{M}$ can thus be reduced to finding the optimal $\pi$ for a single-agent MDP $\dot{\mathcal{M}}$ in which the transition dynamics compose the choices of actions by the other agents $\{\pi_{i}:i\neq k\}$ , and the original transition dynamics $\mathcal{T}$ [12].

Human-AI collaboration through deep RL. In the simplest human-AI collaboration setting, we are given a two-player MDP $\mathcal{M}$ consisting of a human and an AI agent. If the human policy $\pi_{H}$ were known, we could simply embed the policy in the environment (as described above) and use RL to obtain an optimal collaborative policy. Given that sampling from the true human policy $\pi_{H}$ is expensive, training a deep RL policy with the human fully in-the-loop is unfeasible. The approach taken by previous work [6, 29] is to leverage human data to learn a human model $\hat{\pi}_{H}$ , embed it into the environment, and then use deep RL to find an optimal policy for this induced single-agent MDP $\dot{\mathcal{M}}$ . If the human model is of sufficiently high quality (i.e. $\hat{\pi}_{H}$ is sufficiently close to $\pi_{H}$ ), the resulting agent policy will be able to collaborate with humans at test time.

Best Response (BR). In a two-player collaborative game, the best response operator $\text{BR}:\Pi\rightarrow\Pi$ takes in a fixed partner policy $\pi$ and returns an agent policy that maximizes the expected collaboration rewards $\rho$ when paired with $\pi$ : $\text{BR}(\pi)=\operatorname*{arg\,max}_{\tilde{\pi}\in\Pi}\mathbb{E}[\rho(\tilde{\pi},\pi)]$ . Note that computing the optimal policy for the single-agent MDP $\dot{\mathcal{M}}$ above is equivalent to computing the best response to the human model $\pi_{H}$ . In environments that are sufficiently complex, computing best responses exactly is unfeasible, leading to the usage of approximate methods: in our setting, deep RL can be considered our choice of approximate best response operator $\widehat{\text{BR}}$ [12].

4 Optimal Behavior Prior (OBP) for efficient human modelling

Setup. We want to obtain a human model of the highest possible quality given a limited set of human data that we have available. Let $\mathcal{E}$ be a distribution over the possible environments that we are interested in deploying our system in. We have access to human-human gameplay data for $N$ environments from the distribution $\mathcal{E}$ . That is, we will have a dataset of a human collaborating with another human $D_{HH}^{i}$ with $i\in\{1,\dots,N\}$ , for each environment $e_{i}\sim\mathcal{E}$ . Note that when $\mathcal{E}$ deterministically returns a single-environment, this setup is equivalent to a more traditional single-environment case.

Optimal-Behavior Prior (OBP). We leverage the observation that – from a Bayesian perspective – the parameter initialization of a human model can be thought of as encoding a specific prior over the model space [23, 11]. The standard approach of randomly initializing parameters from high-capacity human models – insofar as inputs will tend to output high-entropy distributions over actions – is equivalent to a prior that humans will have highly random behavior. Our insight is that while human behavior is generally not optimal (or one could use optimal agents as human models), in the vast majority of tasks, it is a fair assumption to expect humans to be better at the task than random behavior (i.e. humans “try to be optimal”). This leads us to expect that substituting the standard “random behavior” prior with an “optimal behavior prior” would facilitate the approximate Bayesian inference enacted by the model optimization [23], potentially requiring significantly fewer iterations (and less data) to achieve the same level of human model quality. In practice, we show that even the simplest approach to encode such a prior – using the model weights from a trained (near)-optimal agent as weight initialization for our human model – works much better.

Computing OBP. Straightforwardly applying this idea requires the optimal behavior model (from which we intend to take the parameterization) and the human model to be of the same form. We consider neural network models, as they are plausibly expressive enough to represent both optimal and human behavior with the same model form. To obtain optimal behavior to use as prior for our multi-environment human model, we train a policy in self-play over the distribution of environments $\mathcal{E}$ , leading to a near-optimal agent over $\mathcal{E}$ , $\text{SP}^{\mathcal{E}}$ , whose parameterization we can take and use to warm-start the training of our human model $\text{BC}^{\mathcal{E}}_{\text{OBP}}$ (see Figure 1 and Algorithm 1).

OBP with freezing ( $\text{OBP}+f$ ). One can also view OBP through the lens of representation learning [5]: the representations learned by optimal agents will likely be more relevant to the task than randomly initialized ones. We investigate whether the intermediate representations learned by optimal agents are sufficient to model human behavior, by freezing the first $k$ network layers from the OBP initialization and only fine-tuning the remaining network layers. From the Bayesian perspective, this can be seen as extra regularization, by encoding a Dirac prior on certain components of the model (i.e. complete certainty about the representations up until layer $k$ ) – this assumption of absolute certainty about representations is thus implicitly equivalent to the assumption that the remaining free parameters still contain enough expressivity to represent human behavior.

Overview of assumptions. To summarize, our main assumptions are: access to a distribution of environments of interest $\mathcal{E}$ , similar to ones we want to be able to collaborate with humans in at test time; access to human dataset from a limited number of diverse environments $D_{HH}^{1:N}$ ; being able to train (near-)optimal behavior over environments in $\mathcal{E}$ ; being able to use the same model form for our human model as the (near-)optimal behavior parameterization. Finally, we are assuming that human behavior at the task at hand is closer to optimality than to chance.

Having obtained a general human model, we can then extend the deep RL method described in Section 3 to train collaborative AIs across the whole of $\mathcal{E}$ through domain randomization. See Algorithm 1 for the full algorithm for training the OBP human model and the collaborative AI agent.

Algorithm 1 Behavior Cloning with OBP, and Human-Aware Training

\mathcal{E}

, a distribution of environments;

D_{HH}^{i}

for

i\in\{1,\dots,N\}

, human-human data for

N

environments in

\mathcal{E}

;

\text{SP}^{\mathcal{E}}\leftarrow

SelfPlay(

\mathcal{E}

)

\text{BC}^{\mathcal{E}}_{\text{OBP}}\leftarrow

\text{SP}^{\mathcal{E}}

\triangleright

Initialize the behavior cloning policy with near-optimal behavior

\text{BC}^{\mathcal{E}}_{\text{OBP}}\leftarrow

BehaviorCloning

\big{(}\text{BC}^{\mathcal{E}}_{\text{OBP}}

(D_{HH}^{i})_{i=0}^{N}\big{)}

\triangleright

Fine-tune the initial policy on human data

\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}})\leftarrow

HumanAwareRLTraining

\big{(}\mathcal{E},\text{BC}^{\mathcal{E}}_{\text{OBP}}\big{)}

\triangleright

Train collaborative AI as in [6]

5 Experimental setup

Hypotheses. In our experiments, we hope to investigate the following two hypotheses:

: H1. OBP enables human models to better generalize to unseen environments, measured by reward similarity to H and validation loss.
: H2. When trained with a human model using OBP initialization, human-aware AI agents become better at collaborating with real humans on unseen environments, as measured by task reward.

To validate H1 and H2, we respectively consider a variety of possible training regimes for human models (Section 5.1), and collaborative agents (Section 5.2).

5.1 Multi-environment human models

We use high-capacity models – neural networks – to parameterize models of human behavior. Thanks to their expressivity, the same model form can also be used to parameterize (near-)optimal behavior. Below are the human model training regimes which we consider:

Vanilla BC ( $\text{BC}^{\mathcal{E}}$ ). We take a randomly initialized neural network, and perform behavior cloning on all the collected human data $D_{HH}^{1:N}$ directly, i.e. supervised learning on the state-action pairs.

Multi-environment self-play ( $\text{SP}^{\mathcal{E}}$ ). We train an agent on the distribution of environments $\mathcal{E}$ , incentivizing the policy to achieve high collaborative reward in self-play on all environments simultaneously. This $\text{SP}^{\mathcal{E}}$ agent is later used as proxy for optimal behavior. We include them in this section because $\text{SP}^{\mathcal{E}}$ is a baseline for human modeling: when no data is available to obtain a human model of good quality, using optimal behavior as a proxy for human behavior directly can sometimes be sufficient [19, 28, 45, 40]. As a caveat, one can only expect this approach to work well in competitive environments, but has no guarantees in collaborative settings [6].

BC with optimal behavior prior ( $\text{BC}^{\mathcal{E}}_{\text{OBP}}$ ). To enable the multi-environment behavior cloning agents to more efficiently leverage human data, we encode the optimal behavior prior as described in Section 4, initializing the BC network with the $\text{SP}^{\mathcal{E}}$ agent’s weight values.

BC with optimal behavior prior and freezing ( $\text{BC}^{\mathcal{E}}_{\text{OBP}+f}$ ). We also consider an additional condition where all layers except the last of the OBP initialization are frozen, and only the last-layer weights are allowed to update using the human data. As described in Section 4, this allows us to apply extra regularization and understand whether the representations learned by the earlier layers of the parameterized OBP prior need to be changed.

5.2 Best Response (BR) agents

In order to obtain collaborative AIs, we use deep RL to find an approximate best response $\widehat{\text{BR}}$ to a human model $\hat{H}$ as described in Section 3. However, in theory the OBP method is agnostic to the approach used to compute the best response to the human model once it’s obtained.

Multi-environment self-play: $\text{SP}^{\mathcal{E}}=\widehat{\text{BR}}(\text{SP}^{\mathcal{E}})$ . As a baseline collaborative AI, we consider the multi-environment self-play agent from Section 5.1. By definition, such agent is a best response to itself: $\widehat{\text{BR}}(\text{SP}^{\mathcal{E}})=\text{SP}^{\mathcal{E}}$ . We will thus refer to $\widehat{\text{BR}}(\text{SP}^{\mathcal{E}})$ as $\text{SP}^{\mathcal{E}}$ – this is the agent we use in practice.

Multi-environment human-aware agents: $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}})$ , $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}})$ , and $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}+f})$ . These collaborative AI agents are instead explicitly trained to complement human models based on human data, by training an approximate best response on environments from the multi-environment distribution $\mathcal{E}$ via deep RL (as described in Section 3). We do this for all human-data-based human models defined in Section 5.1, obtaining $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}})$ , $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}})$ , and $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}+f})$ .

5.3 Simulated user study

Framework. For all our experiments, we use the Overcooked-AI environment framework from [6], which was designed as a test-bed for human-AI collaboration performance, and has been used in other various works [26, 9, 32]. In Overcooked-AI environments, two cooks in a kitchen gridworld collaborate to cook as many soups as possible within a time limit. See Figure 2 for an example.

Why a simulated user study? We evaluate our hypothesis that OBP leads to better generalization by generating data from a “ground truth human proxy”, and using that data for training and testing. We do this because having access to the “ground truth human” enables us to run more extensive analysis that wouldn’t be possible with real people: for example, we can see how good our human model is relative to good it could possibly be (e.g. maybe the ground truth is so noisy that anything beyond a certain level of predictive accuracy is impossible), in addition to testing how much better our model is relative to other baselines. Further, it lets us analyze the importance of the optimal-behavior prior in isolation, without confounding factors encountered when interacting with real users, such as their level of adaptation to agent.

Ground truth human proxy (H). On a high-level, our ground truth human proxy model has different parameters which are fit to real human data (or manually selected to induce human-like behavior). Its performance is no-where near optimal (compare H to $\text{SP}^{\mathcal{E}}$ reward in Figure 4), as could be expected for actual human performance. This suggests that our ground truth human proxy won’t benefit by using OBP relative to real humans. While in Section 6 we only show results for one ground truth human, in Section C.1 we describe other two mostly orthogonal ways of obtaining simulated humans which in preliminary experiments gave similar results to the ones in Section 6, giving more credence to the hypothesis that this approach could also generalize to working with real humans.

Distribution of layouts. All $\sim 10^{13}$ layouts in the training environment distribution $\mathcal{E}$ are $7\times 5$ in size, but differ in location of walls, pots, and dispensers (see Appendix A). For the evaluation layouts, we selected 5 layouts from $\mathcal{E}$ : $e^{\text{test}}_{0},e^{\text{test}}_{1},e^{\text{test}}_{2},e^{\text{test}}_{3},e^{\text{test}}_{4}$ in the neighborhood of 100th, 75th, 50th, 25th and 0th percentile of maximum achievable self-play rewards (see Figure 3). Selecting layouts based on achievable scores – instead of random sampling – gives us insights into the performance of the human models and best responses across the spectrum of layout difficulties. Evaluating on a small set of test environments also allows us to compare multi-env approaches with single-env ones (Section 6.2).

Simulated human data. By pairing the simulated human $H$ with itself in $N=40$ layouts for 1 game per layout, we obtain a multi-layout dataset $D_{HH}^{1:N}$ . This dataset is then equally split into a training set $D_{HH}^{\text{train}}$ and a test set $D_{HH}^{\text{test}}$ .

Human model and $\widehat{\text{BR}}$ model. All human models and all collaborative agents share the same neural network architecture: the models take in the full Overcooked-AI game state and output a probability distribution over 6 possible actions: 4 directions (e.g. north), wait, and interact. Taking in a full game-state featurization (see Appendix B) is an important difference from prior work on human modeling in Overcooked-AI [6, 21], which uses a simplified handcrafted state featurization to make behavior cloning be sufficiently data-efficient for any meaningful learning too occur. Given our multi-env setting, this is a necessary choice: developing a good hardcoded featurization which would be sufficient for a whole distribution of layouts would be very challenging in this domain. All BC variants perform behavior cloning on $D_{HH}^{\text{train}}$ by minimizing the cross-entropy loss, and early-stop when the cross-entropy loss on $D_{HH}^{\text{test}}$ starts to increase. Additional information for BC training conditions is located in Section C.2. For collaborative AIs, we perform $\widehat{\text{BR}}$ training with PPO [35]. In the case of multi-layout $\widehat{\text{BR}}$ agents, every on-policy PPO rollout is sampled from a different environment $e\sim\mathcal{E}$ (see Section C.2).

6 Experimental results

6.1 Human modeling results

To test the effectiveness of our method, we tested imitation learning using OBP or not, whether we froze the representation layers ( $+f$ ), and the usage of multi-layout human data (BC or SP). We measure the rewards obtained by pairing the human model with a copy of itself on the evaluation layouts; we also measure the validation loss on our human data from the evaluation layouts, $D_{HH}^{\text{test}}$ . See Figure 3.

$\text{SP}^{\mathcal{E}}$ is a bad human model under all metrics considered. As a first observation, we confirm that self-play does very poorly according to our “realism metrics” for human models: both validation loss and rewards obtained by $\text{SP}^{\mathcal{E}}$ agents are drastically higher than those of the ground truth human model proxy. Qualitatively, $\text{SP}^{\mathcal{E}}$ can also immediately be identified as non-human (and specifically superhuman): real humans (and the ground truth human proxy) chooses “wait” actions around 55% of timesteps (due to the game’s high clock-speed), while $\text{SP}^{\mathcal{E}}$ never does (as a near-optimal agent).

$\text{BC}^{\mathcal{E}}$ without OBP gets zero reward. Since the reward is sparse, zero reward means that the agents never completed an entire delivery (while we do qualitatively observe them perform initial tasks, they eventually get stuck). Poor human modeling performance relative to prior work can be explained by the greater difficulty of modelling human behavior across many different layouts, plus training using the full state representation (of dimensions $26\times 7\times 5$ ) rather than a simplified $66$ -dimensional manual featurization (as explained in Section 5.3).

OBP significantly improves the reward, but not the loss. Quite surprisingly, although using OBP leads the human model to achieve significantly higher reward (which is much closer to the ground truth human’s reward, and thus “reward-realistic”), the cross-entropy losses are almost identical across all BC variants. By performing a more in-depth analysis, we found that BC without OBP arrives at low cross-entropy solutions mostly by assigning excessive probability to the human’s probability of waiting (as most human actions are “wait”). For a detailed description of this phenomenon, see Appendix D. Qualitatively, human models using OBP appear to be the most human-like, as they are actually able to complete the task, while simple $\text{BC}^{\mathcal{E}}$ is not able to complete any deliveries when paired with itself, and as mentioned above $\text{SP}^{\mathcal{E}}$ appears clearly superhuman.

Representation freezing. Overall, freezing seems to be somewhat beneficial: it seems to increase rewards especially for harder layouts, and slightly lowers validation loss across the board when used with OBP, leading to validation losses which are very similar (or marginally better) than simple BC (for a more in-depth analysis see Appendix D). This goes to show that the extra regularization applied by freezing further improves the data-efficiency of the human model training – albeit slightly – and that thus optimal agent and human representations can be shared quite easily.

Overall, the qualitative results support H1, and so does the reward metric, while the validation loss results are inconclusive. We provide a more detailed discussion about this, and human model “quality metrics” more generally in Section 7.

6.2 Collaborative AI results

In Section 6.1 we showed that using OBP can improve human model quality – but does using such improved models also improve collaborative AIs who best respond to them?

For deep RL training, we varied the human models to which we are computing a $\widehat{\text{BR}}$ to ( $\text{SP}^{\mathcal{E}}$ , $\text{BC}^{\mathcal{E}}$ , $\text{BC}^{\mathcal{E}}_{\text{OBP}}$ , and $\text{BC}^{\mathcal{E}}_{\text{OBP}+f}$ ). This respectively leads to the following conditions: $\text{SP}^{\mathcal{E}},\widehat{\text{BR}}(\text{BC}^{\mathcal{E}})$ , $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}})$ , and $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}+f})$ . We measure collaboration rewards by pairing trained $\widehat{\text{BR}}$ agents with the ground truth proxy human $H$ on our 5 evaluation layouts. See Figure 5.

Using OBP leads to best collaboration reward. Using OBP-based human models enables the best responses $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}})$ and $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}+f})$ to achieve the highest human-collaboration rewards on almost all layouts (although sometimes by slim margins); this supports the claim that when trying to obtain collaborative AI agents, it’s often advantageous to use the improved human models obtained by OBP, $\text{BC}^{\mathcal{E}}_{\text{OBP}}$ and $\text{BC}^{\mathcal{E}}_{\text{OBP}+f}$ , instead of assuming human optimality directly ( $\text{SP}^{\mathcal{E}}$ ). Qualitatively we find that $\text{SP}^{\mathcal{E}}$ agents often force their human collaborators to change their trajectories, whereas the $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}})$ and $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}+f})$ agents stay out of the human’s way, leading to a smoother collaboration experience.

SP does surprisingly well relative to $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}})$ . While this is seemingly in contradiction with prior work, we found this to be because $\text{BC}^{\mathcal{E}}$ was learned without a handcrafted featurization of the state (as mentioned in Section 5.3), leading the resulting human model $\text{BC}^{\mathcal{E}}$ to be “low quality” (at least in terms of reward). Overall, one insight from this is that unless you are able to obtain a sufficiently high-quality human model, assuming human optimality (i.e. using self-play models as human models, and thus also as best responses) seems to work well in practice in Overcooked-AI. This effect might be particularly marked in our distribution of environments $\mathcal{E}$ given that the main source of human suboptimality for such relatively simple tasks is their speed (as mentioned before, more than half of all actions tend to be “wait”): on a strategic level, real humans and self-play agents don’t seem to differ much in their gameplay (see Appendix D).

Our simulated user study mostly supports H2, giving hope that it would also hold with real humans.

Corresponding single-layout agents. To further investigate the effectiveness of our approach, we introduce single-layout counterparts to the multi-layout agents defined in Section 5.2: $\text{SP}^{1}$ and $\text{BC}^{1}_{\text{OBP}}$ . Unlike the multi-layout agents, all single-layout agents have direct access to both the evaluation layout $e_{i}$ and human data on the evaluation layout $D_{HH}^{e^{\text{test}}_{i}}$ during training, and thus all the human-aware training happens on $e^{\text{test}}_{i}$ with BC partner who had access to $D_{HH}^{e^{\text{test}}_{i}}$ . See Figure 5 for results.

Multi-layout agents can do better than the single-layout ones. In most cases, the zero-shot performance of $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}})$ on evaluation layouts is comparable to its single-layout counterpart $\widehat{\text{BR}}(\text{BC}^{1}_{\text{OBP}})$ (which had access to the layout for training!) – and on easier layouts it can even be significantly better. This speaks to the approximate nature of our $\widehat{\text{BR}}$ operator, deep RL. Perhaps surprisingly, multi-layout training seems to improve the quality of the optimum reached by deep RL, potentially thanks to the increased robustness and improved representations required to do well on an entire distribution of layouts relative to a simple one. This seems to suggest that making the RL problem harder through multi-environment training can also play a role in improving downstream human-AI collaboration, when policies are obtained through deep RL.

7 Discussion

Metrics for evaluating human models. One takeaway from our experiments is that it’s not obvious how one should judge the quality of human models. Human models can have the exact same loss (for human data prediction) but lead to radically different qualitative behavior. Similarity in reward to humans seems like a potentially better metric, but it’s also not bulletproof: in the most extreme case, memorizing the training data would lead to the exact same reward as the demonstrations.

Limitations and future work. Firstly, while we performed our experiments with simulated ground truth proxy humans, more work is required to verify that these results will hold with real humans. In addition, a clear avenue of future work would be that of exploring other behavior priors which might be closer to human than optimal behavior, including but not limited to max-entropy agents. Another interesting avenue of future work regards the warm-starting of the training of optimal agents: in complex domains, it’s common to bootstrap RL with a imitation learning policy to speed up training [28, 45]. Inspired by this, we speculate whether it may be advantageous to combine these approaches, first training a simple BC model (which is very fast) which can be used to bootstrap RL training (as it helps address exploration problems), and then – assuming that the information in the human data would be lost to catastrophic forgetting – one could using the resulting RL policy to boostrap the training of a better human model (via OBP).

Summary. In our work, we try to create generalizable human models for the purposes of improving human modeling and human-AI collaboration. While straightforward multi-environment human modeling struggles to learn anything meaningful, we find that good quality multi-environment human modeling is rendered possible by leveraging the knowledge that humans will be better than random with an optimal behavior prior. We then show that this type of modeling approach is also advantageous for downstream collaborative reward, performing better than modeling the human as optimal.

Acknowledgments and Disclosure of Funding

We’d like to thank members of the InterACT lab for helpful discussion. This work was supported by NSF CAREER, NSF NRI, ONR YIP, and the NSF Fellowship.

References

Abramson et al. [2021] DeepMind Interactive Agents Team Josh Abramson, Arun Ahuja, Arthur Brussee, Federico Carnevale, Mary Cassin, Felix Fischer, Petko Georgiev, Alex Goldin, Tim Harley, Felix Hill, Peter C. Humphreys, Alden Hung, Jessica Landon, Timothy P. Lillicrap, Hamza Merzic, Alistair Muldal, Adam Santoro, Guy Scully, Tamara von Glehn, Greg Wayne, Nathaniel Wong, Chen Yan, and Rui Zhu. Creating multimodal interactive agents with imitation and self-supervised learning. ArXiv, abs/2112.03763, 2021.
Albrecht and Ramamoorthy [2013] Stefano V. Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learning method for ad hoc coordination in multiagent systems. In AAMAS, 2013.
Bain and Sammut [1999] Michael Bain and Claude Sammut. A framework for behavioural cloning. In Machine Intelligence 15, Intelligent Agents [St. Catherine’s College, Oxford, July 1995], pages 103–129, Oxford, UK, UK, 1999. Oxford University. ISBN 0-19-853867-7. URL http://dl.acm.org/citation.cfm?id=647636.733043.
Barnes Jr [1984] James H Barnes Jr. Cognitive biases and their impact on strategic planning. Strategic Management Journal, 5(2):129–137, 1984.
Bengio et al. [2012] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538, 2012. URL http://arxiv.org/abs/1206.5538.
Carroll et al. [2019] Micah Carroll, Rohin Shah, Mark K. Ho, Tom Griffiths, Sanjit A. Seshia, Pieter Abbeel, and Anca D. Dragan. On the utility of learning about humans for human-ai coordination. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5175–5186, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/f5b1b89d98b7286673128a5fb112cb9a-Abstract.html.
Cazenave [2018] Tristan Cazenave. Residual networks for computer go. IEEE Transactions on Games, 10:107–110, 2018.
Choudhury et al. [2019] Rohan Choudhury, Gokul Swamy, Dylan Hadfield-Menell, and Anca D. Dragan. On the utility of model learning in HRI. CoRR, abs/1901.01291, 2019. URL http://arxiv.org/abs/1901.01291.
Fontaine et al. [2021] Matthew C. Fontaine, Ya-Chuan Hsu, Yulun Zhang, Bryon Tjanaka, and Stefanos Nikolaidis. On the Importance of Environments in Human-Robot Coordination. arXiv:2106.10853 [cs], June 2021. URL http://arxiv.org/abs/2106.10853. arXiv: 2106.10853.
Ghost Town Games [2016] Ghost Town Games. Overcooked, 2016. https://store.steampowered.com/app/448510/Overcooked/.
Grant et al. [2018] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting Gradient-Based Meta-Learning as Hierarchical Bayes. arXiv:1801.08930 [cs], January 2018. URL http://arxiv.org/abs/1801.08930. arXiv: 1801.08930.
Greenwald et al. [2017] Amy Greenwald, Jiacui Li, and Eric Sodomka. Solving for Best Responses and Equilibria in Extensive-Form Games with Reinforcement Learning Methods. In Can Başkent, Lawrence S. Moss, and Ramaswamy Ramanujam, editors, Rohit Parikh on Logic, Language and Society, volume 11, pages 185–226. Springer International Publishing, Cham, 2017. ISBN 978-3-319-47842-5 978-3-319-47843-2. doi: 10.1007/978-3-319-47843-2_11. URL http://link.springer.com/10.1007/978-3-319-47843-2_11. Series Title: Outstanding Contributions to Logic.
Hernandez et al. [2019] Daniel Hernandez, Kevin Denamganaï, Yuan Gao, Pete York, Sam Devlin, Spyridon Samothrakis, and James Alfred Walker. A generalized framework for self-play training. 2019 IEEE Conference on Games (CoG), pages 1–8, 2019.
Howard and Ruder [2018] Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In ACL, 2018.
Hu and Foerster [2020] Hengyuan Hu and Jakob N. Foerster. Simplified action decoder for deep multi-agent reinforcement learning. ArXiv, abs/1912.02288, 2020.
Hu et al. [2020] Hengyuan Hu, Adam Lerer, Alexander Peysakhovich, and Jakob N. Foerster. "other-play" for zero-shot coordination. ArXiv, abs/2003.02979, 2020.
Hu et al. [2022] Hengyuan Hu, David J. Wu, Adam Lerer, Jakob Nicolaus Foerster, and Noam Brown. Human-ai coordination via human-regularized search and learning. ArXiv, abs/2210.05125, 2022.
Jacob et al. [2022] Athul Paul Jacob, David J. Wu, Gabriele Farina, Adam Lerer, Anton Bakhtin, Jacob Andreas, and Noam Brown. Modeling strong and human-like gameplay with kl-regularized search. ArXiv, abs/2112.07544, 2022.
Jaderberg et al. [2019] Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. Science, 364(6443):859–865, May 2019. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.aau6249. URL http://arxiv.org/abs/1807.01281. arXiv: 1807.01281.
Kahneman et al. [1982] Daniel Kahneman, Stewart Paul Slovic, Paul Slovic, and Amos Tversky. Judgment under uncertainty: Heuristics and biases. Cambridge university press, 1982.
Knott et al. [2021] Paul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, Anca D. Dragan, and Rohin Shah. Evaluating the robustness of collaborative agents. CoRR, abs/2101.05507, 2021. URL https://arxiv.org/abs/2101.05507.
Lerer and Peysakhovich [2019] Adam Lerer and Alexander Peysakhovich. Learning existing social conventions via observationally augmented self-play. Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019.
Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, and Ryan P. Adams. Early Stopping is Nonparametric Variational Inference. arXiv:1504.01344 [cs, stat], April 2015. URL http://arxiv.org/abs/1504.01344. arXiv: 1504.01344.
McIlroy-Young et al. [2020a] Reid McIlroy-Young, Siddhartha Sen, Jon M. Kleinberg, and Ashton Anderson. Aligning superhuman ai with human behavior: Chess as a model system. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020a.
McIlroy-Young et al. [2020b] Reid McIlroy-Young, Russell Wang, Siddhartha Sen, Jon Kleinberg, and Ashton Anderson. Learning Personalized Models of Human Behavior in Chess. arXiv:2008.10086 [cs], August 2020b. URL http://arxiv.org/abs/2008.10086. arXiv: 2008.10086.
Nalepka et al. [2021] Patrick Nalepka, Jordan Gregory-Dunsmore, James Simpson, Gaurav Patil, and Michael Richardson. Interaction Flexibility in Artificial Agents Teaming with Humans. Proceedings of the Annual Meeting of the Cognitive Science Society, May 2021.
Nikolaidis and Shah [2013] Stefanos Nikolaidis and Julie A. Shah. Human-robot cross-training: Computational formulation, modeling and evaluation of a human team training strategy. 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 33–40, 2013.
OpenAI [2018] OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
Puig et al. [2020] Xavier Puig, Tianmin Shu, Shuang Li, Zilin Wang, Joshua B. Tenenbaum, Sanja Fidler, and Antonio Torralba. Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration. arXiv:2010.09890 [cs], October 2020. URL http://arxiv.org/abs/2010.09890. arXiv: 2010.09890.
Pynadath and Marsella [2005] David V Pynadath and Stacy C Marsella. Psychsim: Modeling theory of mind with decision-theoretic agents. In IJCAI, volume 5, pages 1181–1186, 2005.
Razavian et al. [2014] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: An astounding baseline for recognition. 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 512–519, 2014.
Ribeiro et al. [2022] Joao G. Ribeiro, Cassandro Martinho, Alberto Sardinha, and Francisco S. Melo. Assisting unknown teammates in unknown tasks: Ad hoc teamwork under partial observability. ArXiv, abs/2201.03538, 2022.
Roelofs et al. [2022] Rebecca Roelofs, Liting Sun, Benjamin Caine, Khaled S. Refaat, Benjamin Sapp, Scott M. Ettinger, and Wei Chai. Causalagents: A robustness benchmark for motion forecasting using causal relationships. ArXiv, abs/2207.03586, 2022.
Rubinstein [1999] Reuven Y. Rubinstein. The cross-entropy method for combinatorial and continuous optimization. Methodology And Computing In Applied Probability, 1:127–190, 1999.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs], August 2017. URL http://arxiv.org/abs/1707.06347. arXiv: 1707.06347.
Silver et al. [2016] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–503, 2016. URL http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html.
Silver et al. [2017a] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, L. Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. ArXiv, abs/1712.01815, 2017a.
Silver et al. [2017b] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Hui, L. Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–359, 2017b.
Stone et al. [2010] Peter Stone, Gal A. Kaminka, Sarit Kraus, and Jeffrey S. Rosenschein. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In AAAI, 2010.
Strouse et al. [2021] DJ Strouse, Kevin R. McKee, Matthew M. Botvinick, Edward Hughes, and Richard Everett. Collaborating with humans without human data. In NeurIPS, 2021.
Tan et al. [2018] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. CoRR, abs/1804.10332, 2018. URL http://arxiv.org/abs/1804.10332.
Torabi et al. [2018] Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018.
Tucker et al. [2020] Mycal Tucker, Yilun Zhou, and J. Shah. Adversarially guided self-play for adopting social conventions. ArXiv, abs/2001.05994, 2020.
[44] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojtek Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II.
Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M. Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Yuhuai Wu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.
Wang et al. [2020] Rose E. Wang, Sarah A. Wu, James A. Evans, Joshua B. Tenenbaum, David C. Parkes, and Max Kleiman-Weiner. Too many cooks: Coordinating multi-agent collaboration through inverse planning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’20, page 2032–2034, Richland, SC, 2020. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450375184.
Yu et al. [2017] Wenhao Yu, C. Karen Liu, and Greg Turk. Preparing for the unknown: Learning a universal policy with online system identification. CoRR, abs/1702.02453, 2017. URL http://arxiv.org/abs/1702.02453.
Zhao et al. [2020] Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pages 737–744, 2020.

Appendix A Additional information about multi-Layout Overcooked-AI

The multi-layout distribution contains roughly $10^{13}$ $7\times 5$ layouts of approximately 75% empty space. Roughly 40% of all counters are occupied by an interactive object (onion dispenser, dish dispenser, pot, or serving station). A cook can move into any adjacent empty square that is not occupied by its collaborator, and they can also interact with any object by facing the object and choosing the interact action. See Appendix Figure 6 for additional details.

Appendix B Model form details

B.1 Human models and best response agents

All human models and best response agents share the same neural network structure. The network takes a lossless state encoding with shape $26\times 7\times 5$ , initially processed by 3 convolutional layers (with 25 kernels per layer, each of size $5\times 5$ , $3\times 3$ , and $3\times 3$ respectively) followed by 3 layers of fully connected layers (of hidden size 64), and output a probability distribution over the legal actions: north, south, east, west, wait, and interact (a $6$ -dimensional vector). While it might improve performance in practice, we don’t use recurrence over state history to reduce the computational burden of training.

B.2 Simulated humans

We run our experiments with simulated humans. At every timestep, such models re-plan an action sequence to complete the next high level task (selected greedily), and take the first action in that plan. They have three tunable parameters: 1) probability of waiting ( $prob\_{wait}$ ): This is the probability of not moving (taking the "wait" action) at each time-step. 2) low-level Boltzmann irrationality ( $ll_{temp}$ ) which injects noise in the low-level action selection process while carrying out the picked goal, where the noise is proportional to the suboptimality of that action (i.e. causing the model to sometimes pick a "left" or "wait" action instead of "right"). 3) high-level Boltzmann irrationality ( $hl_{temp}$ ) which injects noise in the goal-picking process according, once again with noise proportional to the goal cost (i.e. leading the agent to sometimes get onions from a more distant onion dispenser).

In preliminary investigations, we tested different types of simulated humans, which differed in what aspect of the human’s data they were fit to: 1) The waiting-matching $\text{H}_{0}$ only had it’s probability of weighting (which is a parameter) fit to humans’ proportion of waiting time; the low and high level noise are instead set to 0. 2) The action-matching $\text{H}_{1}$ additionally fit to capture real-humans’ suboptimality on the motion and goal-selection level (i.e. minimizing cross-entropy loss with respect to real human actions in the data). 3) The reward-matching $\text{H}_{2}$ mainly focuses on achieving human-level reward – only minimizing the reward difference between real humans and that obtained by the model. In preliminary experiments with all 3 models, we found our method to perform similarly relative to baselines. Throughout the main text, we let $H=H_{0}$ .

Appendix C Training details

C.1 Simulated humans

As described above, in an attempt to make simulated human models be as human-like as possible, we fit their parameters from human data by running cross-entropy methods (CEM) [34] on the human-data collected in [21]. Appendix Table 1 shows the fitted parameters.

In preliminary experiments with all 3 models, we found our method to perform similarly relative to baselines. Throughout the main text, we let $H=H_{0}$ .

H name	$hl_{temp}$	$ll_{temp}$	$prob\_{wait}$	Average Cross-Entropy loss	Average reward fraction
H₀ (H in main results)			0.5	1.62	0.631
H₁	0.072	0.286	0.45	1.81	0.318
H₂	0.070	0.249	0.04	3.19	0.805

Table 1: Three types of simulated humans

\text{H}_{0}

achieves the lowest cross-entropy loss while obtaining a decent level of reward-matching.

\text{H}_{1}

achieves similar cross-entropy loss while achieving lower reward.

\text{H}_{2}

achieves the most human-like reward at the expense of a much higher cross-entropy loss.

C.2 Human models and best response agents

Self-play training. Multi-layout self-play is performed with a batch-size of 200k. Each batch is composed of 500 episodes of length 400, collected on 250 different layouts sampled from $\mathcal{E}$ . We train for 3000 epochs (8 SGD gradient steps with mini-batch of 160k) with PPO. Single-layout self-play training is also performed with a batch-size of 200k, but less training epochs are required: we use 500. Appendix Table 2 provides a summary.

SP method	Training distribution	Number of layout seen	Batch size	Epochs	Entropy coefficient schedule	Entropy horizon
$\text{SP}^{\mathcal{E}}$	$\mathcal{E}$	$1.5\times 10^{6}$	200k	3000	0.2 $\rightarrow$ 0.0005	2000 ep
$\text{SP}^{1}$	$e^{\text{test}}_{i}$	1	200k	500	0.2 $\rightarrow$ 0.005	150 ep

Table 2: Training conditions for SP. We find entropy schedule to make the biggest difference in the multi-layout setting relative to the single-layout one, while other PPO hyper-parameters are kept the same.

\text{SP}^{\mathcal{E}}

achieves the same level of reward as

\text{SP}^{1}

on all evaluation layouts.

Simulated data. We use our "ground truth" human proxy to create a dataset as if we had run an actual user study with 20 pairs of participants. For each pair of participants, we would want to have them collaborate on 7 layouts for one episode – where 5 layouts will be the test ones and the remaining two will be sampled from $\mathcal{E}$ . The intention behind this is that it would enable us to train human models based on data from the $\mathcal{E}$ , and also have ground truth data from the evaluation layouts.

We simulate performing an entire user study as follows: we pair $H$ with itself on 7 layouts ( $e^{\text{test}}_{0}$ , $e^{\text{test}}_{1}$ , $e^{\text{test}}_{2}$ , $e^{\text{test}}_{3}$ , $e^{\text{test}}_{4}$ , and two sampled layouts $e^{\text{sampled}}\sim\mathcal{E}$ ), and for each we collect an episode of length 400. We then repeat this procedure 20 times with the same pair of $H$ agents (but sampling different $e^{\text{sampled}}$ layouts from $\mathcal{E}$ every time).

We then aggregate the data on the $N=40$ sampled layouts $e^{\text{sampled}}_{0:39}$ (the probability of resampling the same layout is negligible) to form the multi-layout human data set $D^{1:N}_{HH}$ , corresponding to the notation introduced in Sec 4 of the main text.

Behavior cloning. Multi-layout behavior cloning is performed on the human-human play described above, which amounts to 32k game timesteps across the distribution of layouts. The single-layout BC counterparts in our experiment use 16k timesteps. For reference, in previous work [21], roughly 24k timesteps were used to train human models on a single layout. Appendix Table 3 provides a summary.

BC method	Behavioral prior	Freezing	Human data	Final training loss	Validation loss
$\text{BC}^{\mathcal{E}}$			32k on $\mathcal{E}$	1.34	1.60 (avg for $e^{\text{test}}_{1:5}$ )
$\text{BC}^{\mathcal{E}}_{\text{OBP}}$	( $\text{SP}^{\mathcal{E}}$ )		32k on $\mathcal{E}$	1.09	1.70 (avg for $e^{\text{test}}_{1:5}$ )
$\text{BC}^{\mathcal{E}}_{\text{OBP}+f}$	( $\text{SP}^{\mathcal{E}}$ )		32k on $\mathcal{E}$	1.23	1.61 (avg for $e^{\text{test}}_{1:5}$ )
$\text{BC}^{1}$			16k on $e^{\text{test}}_{i}$	1.02 (avg for $e^{\text{test}}_{1:5}$ )	1.10 (avg for $e^{\text{test}}_{1:5}$ )
$\text{BC}^{1}_{\text{OBP}}$	( $\text{SP}^{1}$ )		16k on $e^{\text{test}}_{i}$	0.99 (avg for $e^{\text{test}}_{1:5}$ )	1.12 (avg for $e^{\text{test}}_{1:5}$ )

Table 3: Training conditions for all BC methods, split into the multi-layout and single-layout groups. OBP allows the agent to achieve lower training loss. Single-layout variants are capable of achieving lower loss, likely thanks to memorization with a over-parameterized BC network.

Best Response training. The rollouts of partnered-play (where its partner is fixed as the BC agent) episodes with length 400 are collected simultaneously on 500 freshly sampled layouts per iteration to form a batch-size of 200k, for 500 iterations. Appendix Table 4 provides a summary.

BR method	Behavioral prior	Training distribution	Epochs	Entropy coefficient schedule	Entropy horizon
$\widehat{\text{BR}}(\text{BC}^{\mathcal{E}})$ , $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}})$ , $\widehat{\text{BR}}(\text{BC}^{\mathcal{E}}_{\text{OBP}+f})$	( $\text{SP}^{\mathcal{E}}$ )	$\mathcal{E}$	500	0.1 $\rightarrow$ 0.0005	250 ep
$\widehat{\text{BR}}(\text{BC}^{1})$ , $\widehat{\text{BR}}(\text{BC}^{1}_{\text{OBP}})$	( $\text{SP}^{1}$ )	$e^{\text{test}}_{i}$	500	0.2 $\rightarrow$ 0.005	250 ep
$\widehat{\text{BR}}(H)$	( $\text{SP}^{1}$ )	$e^{\text{test}}_{i}$	500	0.2 $\rightarrow$ 0.005	250 ep

Table 4: All best response conditions. Note that we also provide OBP to all best response agents, and give equal access to training for each group of conditions.

Appendix D Additional analysis for human modeling conditions

In the human data used for our simulated humans’ parameter fitting, roughly 40%-65% of the actions were “wait” actions. This is likely due to the rapid clock speed of the game, making it difficult for humans to perform constructive actions at every timestep.

In looking at Figure 4 (right) in the main text, we were surprised that OBP-methods performed as well as non-OBP methods in terms of validation loss. To further investigate the cause of this similar performance, we broke down training datapoints into two groups: those where in which the ground truth action was or wasn’t a "wait" action. We show this in Figure 7.

We discover that $\text{BC}^{\mathcal{E}}$ assigns high probability on wait actions (thus achieving even lower loss than ground truth human H does). This supports our intuition that without having a prior, $\text{BC}^{\mathcal{E}}$ will try to find the easiest route to lower loss: overemphasising “wait” action predictions. On the other hand, by using OBP, $\text{BC}^{\mathcal{E}}_{\text{OBP}}$ and $\text{BC}^{\mathcal{E}}_{\text{OBP}+f}$ both achieve more human-like loss on wait actions. We speculate that this is what makes $\text{BC}^{\mathcal{E}}$ a qualitatively worse human model which is incapable of achieving reward, as mentioned in the main text.

Optimal Behavior Prior: Data-Efficient Human Models for Improved Human-AI Collaboration