This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Measuring Acoustics with Collaborative Multiple Agents111The full paper with appendix together with source code can be found at https://yyf17.github.io/MACMA.

Yinfeng Yu1,6    Changan Chen2    Lele Cao1,3    Fangkai Yang4&Fuchun Sun1,5, 1Beijing National Research Center for Information Science and Technology, State Key Lab on Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University
2University of Texas at Austin
3Motherbrain, EQT Group
4Microsoft Research
5THU-Bosch JCML Center
6College of Information Science and Engineering, Xinjiang University yyf17@mails.tsinghua.edu.cn,    changan@cs.utexas.edu,    lele.cao@eqtpartners.com, fangkai.yang@microsoft.com,    fcsun@mail.tsinghua.edu.cn
Corresponding author: Fuchun Sun.
Abstract

As humans, we hear sound every second of our life. The sound we hear is often affected by the acoustics of the environment surrounding us. For example, a spacious hall leads to more reverberation. Room Impulse Responses (RIR) are commonly used to characterize environment acoustics as a function of the scene geometry, materials, and source/receiver locations. Traditionally, RIRs are measured by setting up a loudspeaker and microphone in the environment for all source/receiver locations, which is time-consuming and inefficient. We propose to let two robots measure the environment’s acoustics by actively moving and emitting/receiving sweep signals. We also devise a collaborative multi-agent policy where these two robots are trained to explore the environment’s acoustics while being rewarded for wide exploration and accurate prediction. We show that the robots learn to collaborate and move to explore environment acoustics while minimizing the prediction error. To the best of our knowledge, we present the very first problem formulation and solution to the task of collaborative environment acoustics measurements with multiple agents.

1 Introduction

Refer to caption
Figure 1: Learn to measure environment acoustics with two collaborative robots. The background color indicates sound intensity (“High”, “Middle” and “Low” areas). Each step (one step per second) embodies three steps: 1) robot 0 emits a sound, and robot 1 receives the sound; 2) robot 1 emits the sound, and robot 0 receives the sound; 3) two robots make a movement following their learned policies. This process repeats until reaching the maximum number of time steps.

Sound is critical for humans to perceive and interact with the environment. Before reaching our ears, sound travels via different physical transformations in space, such as reflection, transmission and diffraction. These transformations are characterized and measured by a Room Impulse Response (RIR) function Välimäki et al. (2016). RIR is the transfer function between the sound source and the listener (microphone). Convolving the anechoic sound with RIR will get the sound with reverberation Cao et al. (2016). RIR is utilized in many applications such as sound rendering Schissler et al. (2014), sound source localization Tang et al. (2020), audio-visual matching Chen et al. (2022), and audio-visual navigation Chen et al. (2020, 2021b, 2021a); Yu et al. (2022b). For example, to achieve clear speech in a concert hall, one might call for a sound rendering that drives more acoustic reverberation while keeping auditoriums with fewer reverberation Mildenhall et al. (2022). The key is to measure RIR at different locations in the hall. However, RIR measuring is time-consuming due to the large number of samples to traverse. To illustrate, in a 5×55\!\!\times\!\!5 m2 room with a spatial resolution of 0.5m, the number of measurable points is 11×1111\!\!\times\!\!11=121. The source location (omnidirectional) can sample one of these 121 points. Assuming a listener with four orientations (0, 90, 180, 270), this listener can choose from 121 points with four directions for each chosen point. So, the number of source-listener pairs becomes 121×121×4121\!\times\!121\!\times\!4=58,564. Assuming the sampling rate, duration and precision of binaural RIR is 16K, 1 second and float32 respectively, one RIR sample requires 2×16000×42\!\times\!16000\times\!4 Bytes = 128KB from computer storage (memory). The entire room would take up to 58,564×12858,564\!\times\!128 KB 7.5\approx\!7.5 GB. Moreover, it also means that one has to move the source/listener devices 58,564 times and performs data sending/receiving for each point.

There are some attempts to solve the challenge of storage: FAST-RIR Ratnarajah et al. (2022b) relies on handcrafted features, source (emitter) and listener (receiver) locations to generate RIR while being fidelity agnostic; MESH2IR Ratnarajah et al. (2022a) uses the scene’s mesh and source/listener locations to generate RIR while ignoring the measurement cost; Neural Acoustic Field (NAF) Luo et al. (2022) tries to learn the parameters of the acoustic field, but its training time and model storage cost grow linearly with the number of environments Majumder et al. (2022). Some work Singh et al. (2021) suggests that storing the original RIR data of the sampled points is optional, and only the acoustic field parameters must be stored. However, given a limited number of action steps, it is challenging to model the acoustic field.

To overcome the aforementioned challenges, we propose MACMA (Measuring Acoustics with Collaborative Multiple Agents) which is illustrated in Figure 1). Both agents, one source (emitter) and one listener (receiver), learn a motion policy to perform significance sampling of RIR within any given 3D scene. The trained agents can move (according to the learned motion policy) in any new 3D scene to predict the RIR of that new scene. To achieve that, we design two policy learning modules: the RIR prediction module and the dynamic allocation module of environment reward. In Appx. B, we explore the design of environmental reward and based on this, and we further propose a reward distribution module to learn how to efficiently distribute the reward obtained at the current step, thereby incentivizing the two agents to learn to cooperate and move. To facilitate the convergence of optimization, we design loss functions separately for the policy learning module, the RIR prediction module, and the reward allocation module. Comparative experiments and ablation experiments are performed on two datasets Replica Straub et al. (2019) and Matterport3D Chang et al. (2017), verifying the effectiveness of the proposed solution. To the best of our knowledge, this work is the first RIR measurement method using two collaborative agents. The main contributions of this work are:

  • we propose a new setting for planning RIR measuring under finite time steps and a solution to measure the RIR with two-agent cooperation in low resource situations;

  • we design a novel reward function for the multi-agent decomposition to encourage coverage of environment acoustics;

  • we design evaluation metrics for the collaborative measurement of RIR, and we experimentally verify the effectiveness of our model.

Refer to caption
Figure 2: The MACMA architecture: the agent 0 and the agent 1 first learn to encode observations as stωs^{\omega}_{t} and stνs^{\nu}_{t} respectively using encoder EωE_{\omega} and EνE_{\nu}, which are fed to actor-critic networks to predict the next action atωa_{t}^{\omega} and atνa_{t}^{\nu}. The RIR Measurement learns how to predict room impulse response W^t\hat{W}_{t} guided by ground truth WtW_{t}.

2 Related Work

RIR generation. Measuring the RIR has been of long-standing interest to researchers Cao et al. (2016); Savioja and Svensson (2015). Traditional methods for generating RIR include statistical based methods Schissler et al. (2014); Tang et al. (2022) and physics-based methods Mehra et al. (2013); Taylor et al. (2012). However, they are computationally prohibitive. Recent methods estimate the acoustics of RIR by parameters to generate RIR indirectly Masztalski et al. (2020); Diaz-Guerra et al. (2021); Ratnarajah et al. (2021). Although these methods are flexible in extracting different acoustic cues, their predictions are independent of the source and receiver’s exact locations, making them unsuitable for scenarios where the mapping between RIR and locations of source and receiver is important (e.g. sound source localization and audiovisual navigation). FAST-RIR Ratnarajah et al. (2022b) is a GAN-based RIR generator that generates a large-scale RIR dataset capable of accurately modeling source and receiver locations, but for efficiency, they rely on handcrafted features instead of learning them, which affects generative fidelity Luo et al. (2022). Neural Acoustic Field (NAF) Luo et al. (2022) addresses the issues of efficiency and fidelity by learning an implicit representation of RIR Mildenhall et al. (2022), and by introducing global features and local embeddings. However, NAF cannot generalize to new environments, and its training time and model storage cost grows linearly with the number of environments Majumder et al. (2022). The recently proposed MESH2IR Ratnarajah et al. (2022a) is an indoor 3D scene IR generator that takes the scene’s mesh, listener positions, and source locations as input. MESH2IR Ratnarajah et al. (2022a) and FAST-RIR Ratnarajah et al. (2022b) assume that the environment and reverberation characteristics have been given, hence they only consider the fitting for the existing dataset, and ignore the measurement cost. However, our model considers how two moving agents collaborate to optimize RIR measuring from the aspects of time consumption, coverage, accuracy, etc. It is worth mentioning that our model addresses multiple scenarios, so that it is more suitable for generalizing to unseen environments.

Audio spatialization. Binaural audio generation methods comprise converting mono audio to binaural audio using visual information in video Garg et al. (2021), utilizing spherical harmonics to generate binaural audio from mono audio for training Xu et al. (2021), and generate binaural audio from video Ruohan and Kristen (2019). Using 360 videos from YouTube to generate 360-degree ambisonic sound Morgado et al. (2018) is a higher-dimensional audio spatialization. Alternatively, Rachavarapu et al. (2021) directly synthesize spatial audio. Audio spatialization has a wide range of practical applications, such as object/speaker localization Jiang et al. (2022), speech enhancement Michelsanti et al. (2021), speech recognition Shao et al. (2022), etc. Although all the above works use deep learning, our work is fundamentally different in that we propose to model binaural RIR measuring as a decision process (of two moving agents that learns to plan measurement) using time-series states as input.

Audio-visual learning. Majumder et al. (2022) harnesses the synergy of egocentric visual and echogenic responses to infer ambient acoustics to predict RIR. The advancement of audiovisual learning has good applications in many tasks, such as audiovisual matching Chen et al. (2022), audiovisual source separation Majumder and Grauman (2022) and audiovisual navigation Chen et al. (2020); Dean et al. (2020); Gan et al. (2020, 2022); Yu et al. (2022a, 2023). There are also work to use echo responses with vision to learn better spatial representations Gao et al. (2020), infer depth Christensen et al. (2020), or predict floor plans of 3D environments Purushwalkam et al. (2021). The closest work to ours is audio-visual navigation, but audio-visual navigation has navigation goals, but the agents in our setup have no clear navigation destinations.

Multi-agent learning. There are two types of collaborative multi-agents: value decomposition based methods Rashid et al. (2018, 2020) and actor-critic Foerster et al. (2018); Lowe et al. (2017) based methods. The centralized training with decentralized execution (CTDE) paradigm Wang et al. (2021) has recently attracted attention for its ability to address non-stationarity while maintaining decentralized execution. Learning a centralized critic with decentralized actors (CCDA) is an efficient approach that exploits the CTDE paradigm. Multi-agent deep deterministic policy gradient (MADDPG) and counterfactual multi-agent (COMA) are two representative examples. In our design, we have a centralized critic (named Critic) with decentralized actors (named AgentActor0 and AgentActor1). But our task differs from all the above multi-agent learning methods in that the agents in our scenario are working on a collaborative task, while previous multi-agent learning research has mainly focused on competitive tasks.

3 The Proposed Approach

Our model MACMA has two collaborative agents moving in a 3D environment in Figure 1, using vision, position, and azimuth to measure the RIR. The proposed model mainly consists of three parts: agent 0, agent 1, and RIR measurement (see Figure 2). Given egocentric vision, azimuth, and position inputs, our model encodes these multi-modal cues to 1) determine the action for agents and evaluate the action taken by the agents for policy optimization, 2) measure the room impulse response and evaluate the regression accuracy for the RIR generator, and 3) evaluate the trade-off between the agents’ exploration and RIR measurement. The two agents repeat this process until the maximal steps have been reached.

Specifically, at each step tt (cf. Figure 2), the robots receive the current observation of their own OtωO^{\omega}_{t} and OtνO^{\nu}_{t} respectively, where Otω=(Itω,Φtω,Ptω)O^{\omega}_{t}=(I^{\omega}_{t},\,\Phi^{\omega}_{t},\,P^{\omega}_{t}), Otν=(Itν,Φtν,Ptν)O^{\nu}_{t}=(I^{\nu}_{t},\,\Phi^{\nu}_{t},\,P^{\nu}_{t}), Itω=(Itω,rgb,Itω,depth)I^{\omega}_{t}=(I^{\omega,\,rgb}_{t},I^{\omega,\,depth}_{t}) and Itν=(Itν,rgb,Itν,depth)\,I^{\nu}_{t}=(I^{\nu,\,rgb}_{t},I^{\nu,\,depth}_{t}) are egocentric visions for robot 0 and robot 1 respectively. Φtω=(φtω,t)\Phi^{\omega}_{t}=\left(\varphi^{\omega}_{t},t\right)\, and Φtν=(φν,t)\Phi^{\nu}_{t}=\left(\varphi_{\nu},t\right) are azimuths for robot 0 and robot 1 respectively. Ptω=(xω,yω,zω)P^{\omega}_{t}=\left(x_{\omega},y_{\omega},z_{\omega}\right)\, and Ptν=(xv,yv,zv)P^{\nu}_{t}=\left(x_{v},y_{v},z_{v}\right) are positions for robot 0 and robot 1 respectively. ItωI^{\omega}_{t} or ItνI^{\nu}_{t} denotes the current visual input that can be RGB (128×\times128×\times3 pixels) and/or depth (with a dimension of 128×\times128×\times1) image222Both RGB and depth images capture the 90-degree field of view in front of the navigating robot., Φtω\Phi^{\omega}_{t} and Φtν\Phi^{\nu}_{t} are 2D vector with time. PtωP^{\omega}_{t} and PtνP^{\nu}_{t} are 3D vector. Although there exists a navigability graph (with nodes and edges) of the environment, this graph is hidden from the robot, hence the robots must learn from the accumulated observations OtωO^{\omega}_{t} and OtνO^{\nu}_{t} to understand the geometry of the scene. At each step, the robot at a certain node A can only move to another node B in the navigability graph if 1) an edge connects both nodes, and 2) the robot is facing node B. The viable robotic action space is defined as 𝒜\mathcal{A} ={MoveForward, TurnLeft, TurnRight, Stop}, where the Stop action should be executed when the robot completes the task or the number of the robot’s actions reach the maximum number of steps. The overall goal is to predict RIR in new scene accurately and explore widely.

3.1 Problem Formulation

We denote agent 0 and agent 1 with superscript ω\omega and ν\nu, respectively. The game =(𝒮,(𝒜ω,𝒜ν),𝒫,(ω,ν))\mathcal{M}=(\mathcal{S},(\mathcal{A}^{\omega},\mathcal{A}^{\nu}),\mathcal{P},(\mathcal{R}^{\omega},\mathcal{R}^{\nu})) consists of state set 𝒮\mathcal{S}, action sets (𝒜ω\mathcal{A}^{\omega}\,, 𝒜ν\mathcal{A}^{\nu}\,), a joint state transition function 𝒫\mathcal{P}\,: 𝒮×𝒜ω×𝒜ν𝒮\mathcal{S}\times\mathcal{A}^{\omega}\times\mathcal{A}^{\nu}\rightarrow\mathcal{S}, and the reward functions ω:𝒮×𝒜ω×𝒜ν×𝒮\mathcal{R}^{\omega}\,:\mathcal{S}\times\mathcal{A}^{\omega}\times\mathcal{A}^{\nu}\times\mathcal{S}\rightarrow\mathbb{R}\, for agent 0 and ν:𝒮×𝒜ω×𝒜ν×𝒮\mathcal{R}^{\nu}:\mathcal{S}\times\mathcal{A}^{\omega}\times\mathcal{A}^{\nu}\times\mathcal{S}\rightarrow\mathbb{R}\, for agent 1. Each player wishes to maximize their discounted sum of rewards. rr is the reward given by the environment at every time step in an episode. MACMA is modeled as a multi-agent Sunehag et al. (2018); Rashid et al. (2018) problem involving two collaborating players sharing the same goal:

min. s.t.π=(π,ω,π,ν)=argmaxπωΠω,πνΠνG(πω,πν,r)\displaystyle\;\mathcal{L}\quad\text{s.t.}\;\pi^{\star}=(\pi^{\star,\,\omega},\pi^{\star,\,\nu})=\!\!\!\!\mathop{\arg\max}\limits_{\pi^{\omega}\in\Pi^{\omega},\,\pi^{\nu}\in\Pi^{\nu}}G(\pi^{\omega},\,\pi^{\nu},\,r) (1)
where G(πω,πν,r)=wωG(πω,r)+wνG(πν,r),\displaystyle\quad G(\pi^{\omega}\!,\pi^{\nu}\!,r)\!=\!w^{\omega}G(\pi^{\omega}\!,r)+w^{\nu}G(\pi^{\nu}\!,r),
G(πω,r)=t=0T1γtrtρω,G(πν,r)=t=0T1γtrtρν,\displaystyle\;G(\pi^{\omega}\!,r)\!=\!\textstyle\sum_{t=0}^{T-1}\gamma^{t}r_{t}\rho^{\omega},\quad G(\pi^{\nu}\!,r)\!=\!\sum_{t=0}^{T-1}\gamma^{t}r_{t}\rho^{\nu},
ρω=(1ρ)/2,ρν=(1ρ)/2,\displaystyle\quad\rho^{\omega}\!\!=(1-\rho)/2,\quad\rho^{\nu}\!\!=(1-\rho)/2,
wω>0,wν>0,0ρ1orρ=1.0,\displaystyle\quad w^{\omega}>0,\quad w^{\nu}>0,\quad 0\leq\rho\leq 1\quad\text{or}\quad\rho=-1.0,\quad

where the loss \mathcal{L} is defined in Equation 5. G(πω,πν,r)G(\pi^{\omega},\,\pi^{\nu},\,r) is the expected joint rewards for agent 0 and agent 1 as a whole. G(πω,r)G(\pi^{\omega},\,r) and G(πν,r)G(\pi^{\nu},\,r) are the discounted and cumulative rewards for agent 0 and agent 1, respectively. wωw^{\omega} and wνw^{\nu} denote the constant cumulative rewards balance factors for agent 0 and agent 1, respectively. ρω\rho^{\omega} and ρν\rho^{\nu} are immediate reward contributions for agent 0 and agent 1, respectively. ρ\rho is a constant (throughout the training) reward allocation parameter. Inspired by Value Decomposition Networks (VDNs) Sunehag et al. (2018) and QMIX Rashid et al. (2018), we construct the objective function G(πω,πν,r)G(\pi^{\omega},\pi^{\nu},r) in Equation 1 by combining the non-negative partial reciprocal constraint respond to G(πω,r)G(\pi^{\omega},r) and G(πν,r)G(\pi^{\nu},r) (see theoretical details in Appx. A.1).

Agent 0, agent 1 and their optimization. The agent 0 and agent 1 receive the current observation Otω=(Itω,Φtω,Ptω)O^{\omega}_{t}=(I^{\omega}_{t},\,\Phi^{\omega}_{t},\,P^{\omega}_{t}) and Otν=(Itν,Φtν,Ptν)O^{\nu}_{t}=(I^{\nu}_{t},\,\Phi^{\nu}_{t},\,P^{\nu}_{t}) at the tt-th step. The visual (ItωI^{\omega}_{t} and ItνI^{\nu}_{t}) part is encoded into a visual feature vector using a CNN encoder: ftω,if_{t}^{\omega,\,i} and ftν,if_{t}^{\nu,\,i} (EωE_{\omega} for agent 0 and EνE_{\nu} for agent 1). Visual CNN encoders EωE_{\omega} and EνE_{\nu} are constructed in the same way (from the input to output layer): Conv8x8, Conv4x4, Conv3x3 and a 256-dim linear layer; ReLU activations are added between any two neighboring layers. PtωP^{\omega}_{t} and PtνP^{\nu}_{t} are embedded by an embedding layer and encoded into feature vectors ftω,pf_{t}^{\omega,\,p} and ftν,pf_{t}^{\nu,\,p}, respectively. Then, we concatenate the two vectors together with ftω,af_{t}^{\omega,\,a} (Φtω\Phi^{\omega}_{t}) and ftν,af_{t}^{\nu,\,a} (Φtν\Phi^{\nu}_{t}) to obtain the global observation embedding etω=[ftω,i,ftω,a,ftω,p]e^{\omega}_{t}=[f_{t}^{\omega,\,i},\,f_{t}^{\omega,\,a},\,f_{t}^{\omega,\,p}] and etν=[ftν,i,ftν,a,ftν,p]e^{\nu}_{t}=[f_{t}^{\nu,\,i},\,f_{t}^{\nu,\,a},\,f_{t}^{\nu,\,p}]. We transform the observation embeddings to state representations using a gated recurrent unit (GRU), stω=GRU(etω,ht11)s^{\omega}_{t}=\textsc{GRU}(e^{\omega}_{t},\,h^{1}_{t-1}). We adopt a similar procedure to obtain stν=GRU(etν,ht12)s^{\nu}_{t}=\textsc{GRU}(e^{\nu}_{t},\,h^{2}_{t-1}). The state vectors (stωs^{\omega}_{t} for agent 0 and stνs^{\nu}_{t} for agent 1) are then fed to an actor-critic network to 1) predict the conditioned action probability distribution πθ1ω(atω|stω)\pi_{\theta_{1}^{\omega}}(a^{\omega}_{t}|s^{\omega}_{t}) for agent 0 and πθ1ν(atν|stν)\pi_{\theta_{1}^{\nu}}(a^{\nu}_{t}|s^{\nu}_{t}) for agent 1, and 2) estimate the state value Vθ2ω(stω,rtω)V_{\theta_{2}^{\omega}}(s^{\omega}_{t},\,r^{\omega}_{t}) for agent 0 and Vθ2ω(stν,rtν)V_{\theta_{2}^{\omega}}(s^{\nu}_{t},\,r^{\nu}_{t}) for agent 1. The actor and critic are implemented with a single linear layer parameterized by θ1ω\theta_{1}^{\omega}, θ1ν\theta_{1}^{\nu}, θ2ω\theta_{2}^{\omega}, and θ2ν\theta_{2}^{\nu}, respectively. For the sake of conciseness, we use 𝜽\boldsymbol{\theta} to denote the compound of θ1ω\theta_{1}^{\omega}, θ1ν\theta_{1}^{\nu}, θ2ω\theta_{2}^{\omega}, and θ2ν\theta_{2}^{\nu} hereafter. The action samplers in Figure 2 sample the actual action (i.e. atωa_{t}^{\omega} for agent 0 and atνa_{t}^{\nu} for agent 1) to execute from πθ1ω(atω|stω)\pi_{\theta_{1}^{\omega}}(a^{\omega}_{t}|s^{\omega}_{t}) for agent 0 and πθ1ν(atν|stν)\pi_{\theta_{1}^{\nu}}(a^{\nu}_{t}|s^{\nu}_{t}) for agent 1, respectively. Both agent 0 and agent 1 optimize their policy by maximizing the expected cumulative rewards G(πω,r)\textsc{G}(\pi^{\omega},r) and G(πν,r)\textsc{G}(\pi^{\nu},r) respectively in a discounted form. The Critic module evaluates the actions taken by agent 0 and agent 1 to guide them to take an improved action at the next time step. The loss of m\mathcal{L}^{m} is formulated as Equation 2.

m=wmωmω+wmνmν,\mathcal{L}^{m}=w^{\omega}_{m}\cdot\mathcal{L}^{\omega}_{m}+w^{\nu}_{m}\cdot\mathcal{L}^{\nu}_{m}, (2)

where mω\mathcal{L}^{\omega}_{m} and mν\mathcal{L}^{\nu}_{m} are motion loss for agent 0 and agent 1 respectively, wmω,wmνw^{\omega}_{m},w^{\nu}_{m} are hyperparameters. The loss mj\mathcal{L}^{j}_{m} is defined as

mj=0.5(V^θj(s)Vj(s))2[A^jlog(πθj(as))+βH(πθj(as))],\begin{split}\mathcal{L}^{j}_{m}=&\sum 0.5\left(\hat{V}_{\theta^{j}}(s)-V^{j}(s)\right)^{2}\\ &\!-\!\sum\left[\hat{A}^{j}\!\log(\pi_{\theta^{j}}(a\!\mid\!s))+\beta\cdot H(\pi_{\theta^{j}}(a\!\mid\!s))\right],\\ \end{split} (3)

where j{ω,ν}j\in\{\omega,\,\nu\}, and the estimated state value of the target network for jj is denoted as V^θj(s)\hat{V}_{\theta^{j}}(s). Vj(s)=maxa𝔸j𝔼[rt+γVj(st+1)st=s]\small\smash{\textsc{V}^{j}(s)=\max_{a\in\mathbb{A}^{j}}\mathbb{E}[r_{t}+\gamma\cdot\textsc{V}^{j}(s_{t+1})\mid s_{t}=s]}. The advantage for a given length-TT trajectory is: A^tj=i=tT1γi+2tδij\small\smash{\hat{\textsc{{A}}}_{t}^{j}=\sum_{i=t}^{T-1}\gamma^{i+2-t}\cdot\delta_{i}^{j}}, where δtj=rt+γVj(st+1)Vj(st)\delta_{t}^{j}=r_{t}+\gamma\cdot\textsc{V}^{j}(s_{t+1})-\textsc{V}^{j}(s_{t}). H(πθj(as))H(\pi_{\theta^{j}}(a\mid s)) is entropy of πθj(as)\pi_{\theta^{j}}(a\mid s). We collectively denote all the weights in Figure 2 except the above actor-critic network for agent 0 and agent 1 as 𝛀\boldsymbol{\Omega} hereafter for simplicity.

RIR measurement and its regression. We encode the observations OtωO^{\omega}_{t} and OtνO^{\nu}_{t} with encoder ErE_{r}, and the output of the encoder ErE_{r} is frf_{r}. The historical observations Ot+1κω,Ot+1κν,,Ot1ω,Ot1ν,Otω,Otν{O^{\omega}_{t+1-\kappa},\,O^{\nu}_{t+1-\kappa},\,\,\cdots,\,\,O^{\omega}_{t-1},\,\,O^{\nu}_{t-1},\,\,O^{\omega}_{t},\,\,O^{\nu}_{t}} are sorted in the memory, and are encoded by EmE_{m} outputting fmf_{m}, where κ\kappa is the length of the memory bank. Then, frf_{r} and fmf_{m} are concatenated. The predicted RIR W^t\hat{W}_{t} is obtained using RIR generator DrD_{r}. For more details for the structure of ErE_{r}, EmE_{m} and DrD_{r}, please refer to Appx. A.2. RIR measurement is learned with the ground truth RIR WtW_{t}. ξ\mathcal{L}^{\xi} denote the loss of RIR measurement. ξ\mathcal{L}^{\xi} are formulated as

ξ\displaystyle\mathcal{L}^{\xi} =(1wMSE)10STFT+wMSE4464.2MSE,\displaystyle=\!(1\!-\!w^{\text{MSE}})\!\cdot\!10\!\cdot\!\mathcal{L}^{\text{STFT}}\!+\!w^{\text{MSE}}\!\cdot\!4464.2\!\cdot\!\mathcal{L}^{\text{MSE}}, (4)
STFT\displaystyle\mathcal{L}^{\text{STFT}} =Δ(Wt,W^t),MSE=MSE(Wt,W^t),\displaystyle=\!\sum\!\Delta(W_{t},\,\hat{W}_{t}),\quad\mathcal{L}^{\text{MSE}}\!=\!\sum\!\text{MSE}(W_{t},\,\hat{W}_{t}),

where W^t\hat{W}_{t} is the predicted RIR from the RIR Measurement module. WtW_{t} is the ground truth RIR. Δ(Wt,W^t)\Delta(W_{t},\,\hat{W}_{t}) is STFT (Short-time Fourier transform) distance. It is calculated by Equation 8. 10 and 4464.2 are experimental parameters from grid search.

The total evaluation. The Critic module is implemented with a linear layer. The total loss of our model is formulated as Equation 5. We minimize \mathcal{L} following Proximal Policy Optimization (PPO) Schulman et al. (2017).

=wmm+wξξ,\mathcal{L}=w_{m}\cdot\mathcal{L}^{m}+w_{\xi}\cdot\mathcal{L}^{\xi}, (5)

where m\mathcal{L}^{m} is the loss component of motion for two agents, ξ\mathcal{L}^{\xi} is the loss component of room impulse response prediction, wmw_{m}\, and wξw_{\xi}\, are hyperparameters. The losses m\mathcal{L}^{m} and ξ\mathcal{L}^{\xi} are formulated in Equation 2 and Equation 4, respectively.

The design of environmental reward.

Refer to caption
Figure 3: Demonstration of the current (A and B) and previous (C and D) positions of two robots. The above four coplanar points are denoted as ΓABCD\Gamma_{ABCD}.

It can be seen from Figure 3 that the ΓABCD\Gamma_{ABCD} is formed by the positions of the two agents at the current time step and the previous time step. rtr_{t} is the current step reward, which is calculated by the following Equation 6.

rt=rtξ+rtζ+rtψ+rtϕ,r_{t}=r^{\xi}_{t}+r^{\zeta}_{t}+r^{\psi}_{t}+r^{\phi}_{t}, (6)

ξ\xi denotes the prediction of room impulse response. ζ\zeta denotes coverage rate. ψ\psi denotes the length of the perimeter of the convex hull. ϕ\phi denotes the area of the perimeter of the convex hull.

Among them, rtξr^{\xi}_{t} is the reward component in terms of measurement accuracy, which evaluates the improvement of the reward of the measurement accuracy of the current step and the reward of the measurement accuracy of the previous step. rtξr^{\xi}_{t} is calculated by

rtξ=αξ(ξtξt1),ξt=Δ(Wt,W^t),r^{\xi}_{t}=\alpha^{\xi}\cdot(\xi_{t}-\xi_{t-1}),\quad\xi_{t}=-\Delta(W_{t},\,\hat{W}_{t}), (7)

where ξt\xi_{t} is the measurement accuracy of the current step. As briefly explained before, Δ(Wt,W^t)\Delta(W_{t},\,\hat{W}_{t}) is the STFT distance that can be calculated by

Δ(Wt,W^t)=0.5Θ(z,z^)+0.5Ξ(z,z^),\Delta(W_{t},\,\hat{W}_{t})=0.5\cdot\Theta(z,\hat{z})+0.5\cdot\Xi(z,\hat{z}), (8)

where zz is the magnitude spectrogram of ground truth RIR WtW_{t} for the current time step, while z^\hat{z} is the corresponding predicted variant. Θ(z,z^)\Theta(z,\hat{z}) is the average loss of spectral convergence for zz and z^\hat{z}; and Ξ(z,z^)\Xi(z,\hat{z}) is the log STFT magnitude loss. Θ(z,z^)\Theta(z,\hat{z}) and Ξ(z,z^)\Xi(z,\hat{z}) are computed with

Θ(z,z^)=zz^FzFandΞ(z,z^)=|log(zz^)|,\Theta(z,\hat{z})=\frac{\lVert z-\hat{z}\rVert_{F}}{\lVert z\rVert_{F}}\quad\text{and}\quad\Xi(z,\hat{z})=\sum\left|\log(\frac{z}{\hat{z}})\right|, (9)

where F\lVert\cdot\rVert_{F} is Frobenius Norm. z=Λ(Wt)=yr2+yi2z=\varLambda(W_{t})=\sqrt{y_{r}^{2}+y_{i}^{2}}, where yry_{r} is real part of STFT transform333The parameters for STFT transform are #\#FFT=1024, #\#shift =120, #\#window=600, window=“Hamming window”. of WtW_{t}, yiy_{i} is an imaginary part of the result of the STFT transform of WtW_{t}. z^=Λ(W^t)\hat{z}=\varLambda(\hat{W}_{t}) is defined similarly to zz, and the calculation process of both zz and z^\hat{z} are the same.

ζt{\zeta}_{t} is the coverage of the current step, which is the ratio of visited nodes (only one duplicate node is counted) to all nodes in the scene at time step tt. We calculate rtζr^{\zeta}_{t} by

rtζ=αζ(ζtζt1).r^{\zeta}_{t}=\alpha^{\zeta}\cdot({\zeta}_{t}-{\zeta}_{t-1}). (10)

ψt\psi_{t} and ϕt\phi_{t} are respectively the perimeter and area of ΓABCD\Gamma_{ABCD} in Figure 3 at time step tt. We calculate rtψr^{\psi}_{t} and rtϕr^{\phi}_{t} with

rtψ=αψ(ψtψt1)andrtϕ=αϕ(ϕtϕt1),r^{\psi}_{t}=\alpha^{\psi}\cdot(\psi_{t}-\psi_{t-1})\quad\text{and}\quad r^{\phi}_{t}=\alpha^{\phi}\cdot(\phi_{t}-\phi_{t-1}), (11)

where αξ=1.0\alpha^{\xi}=1.0, αζ=1.0\alpha^{\zeta}=1.0, αψ=1.0\alpha^{\psi}=-1.0 and αϕ=1.0\alpha^{\phi}=1.0 are hyperparameters (see Appx. A.9).

Overall algorithm. The entire procedure of MACMA is presented as pseudo-code in Algorithm 1.

Algorithm 1 MACMA (Measuring Acoustics with Collaborative Multiple Agents)

Input: Environment \mathcal{E}, # updates MM, # episode NN, max time steps TT.
Parameter: Stochastic policies π\pi, initial actor-critic weights 𝜽0\boldsymbol{\theta}_{0}, initial other weights except for actor-critic weights 𝛀0\boldsymbol{\Omega}_{0}.
Output:Trained weights, 𝜽M\boldsymbol{\theta}_{M} and 𝛀M\boldsymbol{\Omega}_{M}.

1:  for ii=1, 2, … MM do
2:     // Run policy πθi1\pi_{\theta_{i-1}} for NN episodes TT time steps  
3:     {(ot,i,ht1,i,at,i,rt,i)}roll(,π𝜽i1,T)\{(o_{t,i},\,h_{t-1,i},\,a_{t,i},\,r_{t,i})\}\leftarrow\text{roll}(\mathcal{E},\pi_{\boldsymbol{\theta}_{i-1}},T)
4:     Compute advantage estimates  
5:     RIR prediction and environmental reward assignment 
6:     // Optimize w.r.t. 𝜽\boldsymbol{\theta} and 𝛀\boldsymbol{\Omega}
7:     𝜽i,𝛀i\boldsymbol{\theta}_{i},\boldsymbol{\Omega}_{i}\leftarrow new 𝜽\boldsymbol{\theta} and 𝛀\boldsymbol{\Omega} from PPO algorithm w.r.t. minimizing Equation 5
8:  end for

4 Experiments

We adopt the commonly used 3D environments collected using the SoundSpaces platform Chen et al. (2020) and Habitat simulator Savva et al. (2019). They are publicly available as several datasets: Replica Straub et al. (2019), Matterport3D Chang et al. (2017) and SoundSpaces (audio) Chen et al. (2020). Replica contains 18 environments in the form of grids (with a resolution of 0.5 meters) constructed from accurate scans of apartments, offices, and hotels. Matterport3D has 85 scanned grids (1-meter resolution) of indoor environments like personal homes. To measure the RIR in Replica Straub et al. (2019) and Matterport3D Chang et al. (2017), we let two agents move a certain number of steps (250 and 300 steps for Replica and Matterport3D, respectively) throughout the scene and plan a measuring path. At every time step, the two agents measure the RIR while moving. The experimental procedure contains several phases: a) we pretrain generator DrD_{r} under the setting =m\mathcal{L}=\mathcal{L}^{m} (wmw_{m} = 1.0 and wξw_{\xi} = 0.0) with random policy for both agent 0 and agent 1 in the training split, b) we train and validate every baseline with the generator DrD_{r} fine-tune together in the training and validation split, c) we test every baseline in the test split. MACMA is benchmarked towards several baselines: Random, Nearest neighbor, Occupancy Ramakrishnan et al. (2020) and Curiosity Pathak et al. (2017). Random uniformly samples one of three actions and executes Stop when it reaches maximum steps. Nearest neighbor predict from closest experience (Appx. A.3). Occupancy orient to occupy more area, making the area of ΓABCD\Gamma_{ABCD} in Figure 3 larger. Curiosity strives to visit nodes that have not been visited already in the current episode. To evaluate different methods, we adopt the evaluation metrics CR (coverage rate), PE (prediction error),WCR (weighted coverage rate),RTE (RT60 Error) and SiSDR (scale-invariant signal-to-distortion ratio), among which WCR is the most important evaluation metric since it is a trade-off between encouraging high prediction accuracy and more exploration. CR is the ratio of the number of visited nodes by agents and the number of nodes in the current episode. CR=Nv/NeCR=\nicefrac{{N_{v}}}{{N_{e}}}, where NvN_{v} is the total number of unique nodes that two agents have visited together, and NeN_{e} is the total number of all individual nodes in the current episode. PE=Δ(Wt,W^t)\text{PE}=\Delta(W_{t},\,\hat{W}_{t}), where WtW_{t} (Equation 8) is the ground truth RIR, W^t\hat{W}_{t} is the predicted RIR. WCR=(1.0λ)CR+λ(1.0PES)\text{WCR}=(1.0-\lambda)*\text{CR}+\lambda*(1.0-\text{PES}), where PES stands for Scaled Prediction Error. PES=2/(1+exp(PE))1.0\text{PES}=2/(1+\exp(-\text{PE}))-1.0, where 0λ1.00\leq\lambda\leq 1.0 is a hyper-parameter. RTE describes the difference between the ground truth RT60 value and the predicted one. SiSDR = 10log10XT2/XE210\log_{10}\nicefrac{{\lVert X_{T}\rVert^{2}}}{{\lVert X_{E}\rVert^{2}}}, where XE2\lVert X_{E}\rVert^{2} is the error vector and XT2\lVert X_{T}\rVert^{2} is the ground truth vector. We select the hyperparameters in grid search and the details are in Appx. A.4.

Model Replica Matterport3D
WCR (\uparrow) PE (\downarrow) CR (\uparrow) RTE (\downarrow) SiSDR (\uparrow) WCR (\uparrow) PE (\downarrow) CR (\uparrow) RTE (\downarrow) SiSDR (\uparrow)
Random 0.3103 5.4925 0.3439 14.7427 20.3534 0.2036 5.5552 0.2254 23.5281 12.3042
Nearest neighbor 0.3444 5.4533 0.3817 14.0269 22.0135 0.2099 5.3342 0.2321 28.8765 15.2351
Occupancy 0.4464 3.7224 0.4907 12.5532 23.0666 0.2225 4.5327 0.2449 20.3399 18.3848
Curiosity 0.4327 3.4883 0.4742 10.9565 23.8669 0.2111 4.4255 0.2319 29.5572 20.0031
MACMA (Ours) 0.6977 3.2509 0.7669 13.8896 23.6501 0.3030 4.0113 0.3327 15.9338 21.3187
Table 1: The results of quantitative comparison between our proposed method (MACMA) and baselines.
Refer to caption
Figure 4: Visualization of the navigation trajectories by the end of a particular episode from Replica (top row) and Matterport3D (bottom row) dataset. Higher WCR values and bigger “seen” areas (colored in light-grey) indicate better performances.

4.1 Experimental Results

Results are averages of 5 tests with different random seeds.

Quantitative comparison of the two datasets. Results are runs on two datasets under the experimental settings: αξ\alpha^{\xi}=1.0, αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0, κ\kappa=2, λ\lambda=0.1, ρ\rho=-1.0. As seen from Table 1, on the Replica dataset, MACMA achieves the best results on the metrics WCR, PE and CR. But the Curiosity model has the best results on the metrics RTE and SiSDR. The Curiosity model encourages agents to visit more new nodes near them, which drive up the robots’ exploration ability (it improves performance on metrics RTE and SiSDR) while reducing their team’s performance (it reduces performance on the metric CR). The Occupancy model (ranks as the second over CR) motivates the exploratory ability of the entire group (the group of agent 0 and agent 1) but ignores their individual exploration performance (the ranks over the metrics of RTE and SiSDR lower than that of CR). MACMA combines the group exploration ability and the individual exploration ability, achieving a good trade-off between the two abilities, so that the group exploration ability of MACMA has increased by a large margin (e.g. over the CR metric) and finally won the championship on the WCR metric. On the Matterport3D dataset, MACMA achieves the best results on all metrics. As a result, we can conclude that MACMA quantitatively outperforms baselines over both datasets.

Qualitative comparison on exploration capability. Figure 4 shows the navigation trajectories of agent 0 and agent 1 for different algorithms by the end of a particular episode from the Replica (top row) and Matterport3D (bottom row) dataset. The light-gray areas in Figure 4 indicate the exploration field of the robots. We observe that MACMA tends to explore the most extensively compared to the other baselines. Particularly, there are three rooms in the entire scene in Replica, and MACMA is the only method that managed to traverse all three rooms using the same number of time steps as baselines.

Qualitative comparison on RIR prediction. We show the spectrograms generated by these models and from the ground truth in Figure 5.

Refer to caption
Figure 5: Qualitative comparison of RIR prediction (Binaural RIR with channel 0 and channel 1) in spectrogram from Replica (top row) and Matterport3D (bottom row) dataset. Every column is the result of one model except the last one. The last column is the ground truth of RIR.

These binaural spectrograms with channel 0 and channel 1 last for one second (the x-axis is the time axis). The spectrogram of the RIR from both Random’s and Occupancy’s generation have fewer color blocks than the ground truth between 0.2 seconds and 0.4 seconds and more color blocks than the ground truth between 0.8 seconds and 1 second. The spectrogram of the RIR from the Nearest neighbor’s generation has more colored regions than the ground truth spectrogram. The spectrogram of the RIR from Curiosity’s prediction has fewer color blocks than the ground truth between 0.2 seconds and 0.4 seconds. At the same time, the spectrogram of the RIR from Curiosity’s generation has more color blocks than the ground truth between 0.8 seconds and 1 second in the Replica dataset. And the spectrogram of the RIR from Curiosity’s prediction has more colored regions than the ground truth spectrogram in the Matterport3D dataset. The spectrogram of the generated RIR from MACMA (Ours) is the closest to the ground truth spectrogram. In conclusion, from a qualitative human visual point of view, the spectral quality of the RIRs generated by our model is the best. Additionally, in Appx. A.5, we show that the RIR’s quality in the waveform of the RIRs generated by our model is also superior.

4.2 Ablation Studies

Ablation on modality. Results are run on dataset Replica under the experimental settings of αξ\alpha^{\xi}=1.0, αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0, κ\kappa=2, λ\lambda=0.1, ρ\rho=-1.0. As shown in Table 2, RGBD (vision with RGB images and Depth input) seems to be the best choice.

More ablations. We explore the relationship between modality importance, action selection, and RIR measurement accuracy in Appx. A.6 and Appx. A.7. We present the extension model MACMARA (MACMA with a dynamic Reward Assignment module) in Appx. B.5. More ablation studies on memory size κ\kappa and the reward component can be found in Appx. A.4.

Vision WCR (\uparrow) PE (\downarrow) CR (\uparrow) RTE (\downarrow) SiSDR (\uparrow)
Blind 0.5020 3.4966 0.5512 14.2049 23.0903
RGB 0.5930 3.8204 0.6541 15.5897 23.7713
Depth 0.5068 3.4927 0.5566 29.6905 23.5089
RGBD 0.6977 3.2509 0.7669 13.8896 23.6501
Table 2: Ablation on modality.

5 Conclusion

In this work, we propose a novel task where two collaborative agents learn to measure room impulse responses of an environment by moving and emitting/receiving signals in the environment within a given time budget. To tackle this task, we design a collaborative navigation and exploration policy. Our approach outperforms several other baselines on the environment’s coverage and prediction error. A known limitation is that we only explored the most basic setting, one listener (receiver), and one source (emitter), and did not study the settings with two or more listeners or sources. Another limitation of our work is that our current assessments are conducted in a virtual environment. It would be more meaningful to evaluate our method on real-world cases, such as a robot moving in a real house and learning to measure environmental acoustics collaboratively. Lastly, we have not considered semantic information about the scene in policy learning. Incorporating semantic information about the scene into policy learning would be more meaningful. The above three are left for future exploration.

Ethical Statement

This research follows IJCAI’s ethics guidelines and does not involve human subjects or privacy issues with the dataset. The dataset is publicly available.

Acknowledgments

We thank the IJCAI 2023 reviewers for their insightful feedback. This work is supported by multiple projects, including the Sino-German Collaborative Research Project Crossmodal Learning (NSFC62061136001/DFG SFB/TRR169), the National Natural Science Foundation of China (No. 61961039), the National Key R&D Program of China (No. 2022ZD0115803), the Xinjiang Natural Science Foundation (No. 2022D01C432, No. 2022D01C58, No. 2020D01C026, and No. 2015211C288), and partial funding from the THU-Bosch JCML Center.

APPENDIX

We provide additional details for MACMA and the extension version model MACMARA (Measuring Acoustics with Collaborative Multiple Agents with a reward assignment module).

A:   MACMA

A.1:   The theoretical point of view.

A.2:   Implementation of ErE_{r}, EmE_{m} and DrD_{r} in details.

A.3:   Implementation of nearest neighbor in detail.

A.4:   Hypter-parameter selection experiments.

A.5:   More experiments for MACMA.

A.6:   On modality importance for action selection.

A.6:   On modality importance for action selection.

A.8:   Visualization of agents’ state features.

A.9:   Algorithm parameters.

B:   MACMARA

B.1:   The overview of MACMARA.

B.2:   Reward assignment for MACMARA.

B.3:   The loss of MACMARA.

B.4:   Formalation of MACMARA.

B.5:   Experimental results of MACMARA.

B.6:   Algorithm parameters in MACMARA.

Appendix A MACMA

A.1 The Theoretical Point of View

Below we provide some theoretical points of view for formulation.

Monotonic value function factorization theory. Monotonic value function factorization for deep multi-agent reinforcement learning is proposed in QMIX Rashid et al. (2018). This theorem is cited as follows.

Lemma 1.

If i𝐍{1,2,,n},QtotQi0\forall i\in\mathbf{N}\equiv\{1,2,\ldots,n\},\frac{\partial Q_{tot}}{\partial Q_{i}}\geq 0 then

argmax𝐮Qtot(𝝉,𝐮)=(argmax𝐮𝟏Q1(τ1,u1)argmax𝐮𝐧Qn(τn,un)),\leavevmode\resizebox{305.33788pt}{}{ $\mathop{\arg\max}\limits_{\mathbf{u}}Q_{tot}(\boldsymbol{\tau},\mathbf{u})=\left(\begin{array}[]{c}\mathop{\arg\max}\limits_{\mathbf{u^{1}}}Q_{1}\left(\tau^{1},u^{1}\right)\\ \vdots\\ \mathop{\arg\max}\limits_{\mathbf{u^{n}}}Q_{n}\left(\tau^{n},u^{n}\right)\end{array}\right)$}\,, (12)

where Qtot(𝝉,𝐮)Q_{tot}(\boldsymbol{\tau},\mathbf{u}) is a joint action-value function, 𝝉\boldsymbol{\tau} is a history of joint action, and 𝐮\mathbf{u} is a joint action.

The proof of this theorem is provided as:

Proof.

Since QtotQi0\frac{\partial Q_{tot}}{\partial Q_{i}}\geq 0 for i𝐍\forall i\in\mathbf{N}, the following holds for any (u1,,un)\left(u^{1},\ldots,u^{n}\right) and the mixing network function Qtot()Q_{tot}(\cdot) with nn arguments:

Qtot(Q1(τ1,u1),,Qi(τi,ui),,Qn(τn,un))Qtot(maxu1Q1(τ1,u1),,Qi(τi,ui),,Qn(τn,un))Qtot(maxu1Q1(τ1,u1),,maxuiQi(τi,ui),Qn(τn,un))Qtot(maxu1Q1(τ1,u1),,maxuiQi(τi,ui),,maxunQn(τn,un)).\begin{array}[]{l}Q_{tot}\left(Q_{1}\left(\tau^{1},u^{1}\right),\ldots,Q_{i}\left(\tau^{i},u^{i}\right),\ldots,Q_{n}\left(\tau^{n},u^{n}\right)\right)\\ \leq Q_{tot}\left(\mathop{\max}\limits_{u^{1}}Q_{1}\left(\tau^{1},u^{1}\right),\ldots,Q_{i}\left(\tau^{i},u^{i}\right),\ldots,Q_{n}\left(\tau^{n},u^{n}\right)\right)\quad\quad\ldots\\ \leq Q_{tot}\left(\mathop{\max}\limits_{u^{1}}Q_{1}\left(\tau^{1},u^{1}\right),\,\ldots,\,\mathop{\max}\limits_{u^{i}}Q_{i}\left(\tau^{i},u^{i}\right)\ldots,\,Q_{n}\left(\tau^{n},u^{n}\right)\right)\quad\quad\ldots\\ \leq Q_{tot}\left(\mathop{\max}\limits_{u^{1}}Q_{1}\left(\tau^{1},u^{1}\right),\ldots,\mathop{\max}\limits_{u^{i}}Q_{i}\left(\tau^{i},u^{i}\right),\,\ldots,\,\mathop{\max}\limits_{u^{n}}Q_{n}\left(\tau^{n},u^{n}\right)\right).\end{array}

(13)

Therefore, the maximum of the mixing network function is:
(maxu1Q1(τ1,u1),,maxuiQi(τi,ui),,maxunQn(τn,un))\left(\mathop{\max}\limits_{u^{1}}Q_{1}\left(\tau^{1},u^{1}\right),\ldots,\mathop{\max}\limits_{u^{i}}Q_{i}\left(\tau^{i},u^{i}\right),\,\ldots,\,\mathop{\max}\limits_{u^{n}}Q_{n}\left(\tau^{n},u^{n}\right)\right) . Thus,

max𝐮Qtot(𝝉,𝐮):=max𝐮=(u1,,un)Qtot(Q1(τ1,u1),,Qn(τn,un))=Qtot(maxu1Q1(τ1,u1),,maxunQn(τn,un)).\begin{aligned} \mathop{\max}\limits_{\mathbf{u}}Q_{tot}(\boldsymbol{\tau},\,\mathbf{u}):&=\max_{\mathbf{u}=\left(u^{1},\,\ldots,u^{n}\right)}Q_{tot}\left(Q_{1}\left(\tau^{1},u^{1}\right),\ldots,Q_{n}\left(\tau^{n},u^{n}\right)\right)\\ &=Q_{tot}\left(\max_{u^{1}}Q_{1}\left(\tau^{1},u^{1}\right),\ldots,\max_{u^{n}}Q_{n}\left(\tau^{n},u^{n}\right)\right).\end{aligned}

(14)

Letting,

𝐮=(u1,,un)=(argmaxu1Q1(τ1,u1)argmaxunQn(τn,un)),\mathbf{u}_{\star}=\left(u_{\star}^{1},\,\ldots,\,u_{\star}^{n}\right)=\left(\begin{array}[]{c}\mathop{\arg\max}\limits_{u^{1}}Q_{1}\left(\tau^{1},u^{1}\right)\\ \vdots\\ \mathop{\arg\max}\limits_{u^{n}}Q_{n}\left(\tau^{n},u^{n}\right)\end{array}\right),

(15)

we have that

Qtot(Q1(τ1,u1),,Qn(τn,un))=Qtot(maxu1Q1(τ1,u1),,maxunQn(τn,un))=max𝐮Qtot(𝝉,𝐮).\begin{aligned} Q_{tot}\left(Q_{1}\left(\tau^{1},u_{\star}^{1}\right),\,\ldots,\,Q_{n}\left(\tau^{n},u_{\star}^{n}\right)\right)&=Q_{tot}\left(\mathop{\max}\limits_{u_{1}}Q_{1}\left(\tau^{1},u^{1}\right),\,\ldots,\,\mathop{\max}\limits_{u_{n}}Q_{n}\left(\tau^{n},u^{n}\right)\right)\\ &=\mathop{\max}\limits_{\mathbf{u}}Q_{tot}(\boldsymbol{\tau},\,\mathbf{u}).\end{aligned}

(16)

Hence, 𝐮=argmax𝐮Qtot(𝝉,𝐮)\mathbf{u}_{\star}=\mathop{\arg\max}\limits_{\mathbf{u}}Q_{tot}(\boldsymbol{\tau},\mathbf{u}), which proves Equation 12. ∎

The property of the decomposition theorem of the monotone value function. According to Lemma 1, we can get the following property:

Property 1.

If i{ω,ν},G(πω,πν,r)G(πi,r)0\forall i\in\{{\omega},\,{\nu}\},\frac{\partial G(\pi^{\omega},\,\pi^{\nu},\,r)}{\partial G(\pi^{i},\,r)}\geq 0 then Equation 1 has an optimal solution.

Construction of objective function of proposed problem. Inspired by Value Decomposition Networks (VDNs) Sunehag et al. (2018) and QMIX Rashid et al. (2018), we construct the objective function G(πω,πν,r)G(\pi^{\omega},\pi^{\nu},r) in Equation 1 by combining the non-negative partial reciprocal constraint respond to G(πω,r)G(\pi^{\omega},r) and G(πν,r)G(\pi^{\nu},r) according Property 2.

Property 2.

Since wω=G(πω,πν,r)G(πω,r)>0w^{\omega}=\frac{\partial G(\pi^{\omega},\,\pi^{\nu},\,r)}{\partial G(\pi^{\omega},\,r)}>0 and wν=G(πω,πν,r)G(πν,r)>0w^{\nu}=\frac{\partial G(\pi^{\omega},\,\pi^{\nu},\,r)}{\partial G(\pi^{\nu},\,r)}>0 satisfy the conditions of Property 1, so Equation 1 has an optimal solution.

A.2 Implementations of ErE_{r}, EmE_{m} and DrD_{r} in Details

Here we provide the structure of ErE_{r}, EmE_{m} and DrD_{r} in details.

The structure of ErE_{r}.

Refer to caption
Figure 6: The structure of ErE_{r}.

As shown in Figure 6, at the time step tt, the visual (ItrI^{r}_{t}) part is encoded into visual feature vector ftr,if^{r,\,i}_{t} using a CNN encoder. Visual CNN encoders EriE^{i}_{r} is constructed in this way (from the input to output layer): Conv8x8, Conv4x4, Conv3x3 and a 256-dim linear layer; ReLU activations are added between any two neighboring layers. PtrP^{r}_{t} is embedded by a embedding layer and encoded into feature vector ftr,pf_{t}^{r,\,p}. Then, we concatenate the three vectors ftr,if_{t}^{r,\,i}, ftr,af_{t}^{r,\,a} and ftr,pf_{t}^{r,\,p}together to obtain the embedding vector frf_{r}.

The structure of EmE_{m}.

Refer to caption
Figure 7: The structure of EmE_{m}.

As shown in Figure 7, at the time step tt, the visual (ItmI^{m}_{t}) part is encoded into visual feature vector ftm,if^{m,\,i}_{t} using a CNN encoder. Visual CNN encoders EmiE^{i}_{m} is constructed in this way (from the input to output layer): Conv8x8, Conv4x4, Conv3x3 and a 256-dim linear layer; ReLU activations are added between any two neighboring layers. PtmP^{m}_{t} is embedded by a embedding layer and encoded into feature vector ftm,pf_{t}^{m,\,p}. Then, we concatenate the three vectors ftm,if_{t}^{m,\,i}, ftm,af_{t}^{m,\,a} and ftm,pf_{t}^{m,\,p}together to obtain the embedding vector fmf_{m}.

The structure of DrD_{r}. As shown in Figure 8, the input of the DrD_{r} is a 512d latent vector. DrD_{r} contains several upsampling convolution blocks. A leaky rectified linear unit (LReLU), is used after each convolutional layer in DrD_{r} with the final layer of the generator using a sigmoid activation. PN denotes pixel wise normalization, which we use in the generator. The composition of blocks is based on ProGAN Karras et al. (2018). The final step in DrD_{r} is a reshape and extractation to make the output with a shape 2×160002\times 16000.

Refer to caption
Figure 8: Detailed overview of architecture of generator DrD_{r}.

A.3 Implementation Details of Nearest Neighbor

When We finished pretrain the generator DrD_{r}, we fixed our model and generate all the latent vector in the train and validation split for a given dataset. And we store all the latent vector in the form \langlescene, scene_\_id, ftrainif^{i}_{train}, listener’s azimuth, listener’s position index, source position index\rangle ( the ii-th latent vector in the train split). Then we fixed our pretained model and run in test split for a given dataset, the latent vector ftestjf^{j}_{test} generate by Encoder ErE_{r} and EmE_{m} (the jj-th latent vector in the test split). We calculate similarity sj,is_{j,\,i} by KL (Kullback–Leibler) divergence between ftestjf^{j}_{test} and ftrainif^{i}_{train}, sj,i=DKL(ftestj||ftraini)s_{j,\,i}=-D_{KL}(f^{j}_{test}||f^{i}_{train}). We get the nearest neighbor by:

i=argmax𝐢𝐭𝐫𝐚𝐢𝐧sj,i.i^{\star}=\mathop{\arg\max}\limits_{\mathbf{i\in train}}s_{j,\,i}\,. (17)

Then we get the record with the nearest neighbor \langlescene, scene_\_id, ftrainif^{i^{\star}}_{train}, listener’s azimuth, listener’s position index, source position index\rangle. According to this record, the nearest neighbor model gets the RIR waveform by querying from the train split in a given dataset.

A.4 Hyper-Parameter Selection Experiments for MACMA

The choice of parameter wmsew^{mse}. wmsew^{mse} denotes the weight of MSE relative to STFT loss (See in Equation 4). Results (STDEV\leq0.01) are averaged over 5 test runs on dataset Replica during pre-training. Table 3 shows that larger wmsew^{mse} is helpful for WCR and SiSDR, but it is not good for PE and RTE, so wmsew^{mse} = 1.0 is chosen after a trade-off.

wmsew^{mse} WCR (\uparrow) PE (\downarrow) RTE (\downarrow) SiSDR (\uparrow)
0.0 0.1167 2.9037 13.6745 -21.7048
0.5 0.1602 3.1618 13.6742 -15.9758
1.0 0.3037 3.1188 13.8045 24.2508
Table 3: The choice of parameter wmsew^{mse}.

The choice of parameter λ\lambda. λ\lambda is a parameter for metric WCR. Results (STDEV\leq0.01) are averaged over 5 test runs on dataset Replica under the following experimental settings: the model is Curiosity, αξ\alpha^{\xi}=1.0, αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=0.0, αϕ\alpha^{\phi}=0.0, κ\kappa=2, ρ\rho =-1.0.

It can be seen from Table 4 that λ\lambda=0.1 is a good choice.

λ\lambda WCR (\uparrow) PE (\downarrow) CR (\uparrow) RTE (\downarrow) SiSDR (\uparrow)
0.1 0.4327 3.4883 0.4742 10.9565 23.8669
0.5 0.2364 3.7721 0.4278 10.9687 22.9011
0.9 0.2653 3.5234 0.4732 15.9320 24.1206
Table 4: The choice of parameter λ\lambda.

The choice of parameter αψ\alpha^{\psi}. αψ\alpha^{\psi} is a parameter for environmental reward (See in Equation 11). Results (STDEV\leq0.01) are averaged over 5 test runs on dataset Replica under the following experimental settings: the model is Curiosity, αξ\alpha^{\xi}=1.0, αζ\alpha^{\zeta}=1.0, αϕ\alpha^{\phi}=0.0, κ\kappa=2, ρ\rho=-1.0, λ\lambda=0.1. It can be seen from Table 5 that αψ\alpha^{\psi}=-1.0 is a good choice.

αψ\alpha^{\psi} WCR (\uparrow) PE (\downarrow) CR (\uparrow) RTE (\downarrow) SiSDR (\uparrow)
1.0 0.4906 3.4624 0.5384 12.7661 24.0065
-1.0 0.5650 3.4742 0.6211 15.0615 23.7385
Table 5: The choice of parameter αψ\alpha^{\psi}.

Ablation on using αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0. This is ablation study with different reward setting. We consider whether to use these reward components by using the coefficients αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0 respectively (See Equation 10). Let’s take αψ\alpha^{\psi}=-1.0 as an example, if the choice is yes, then αψ\alpha^{\psi}=-1.0; if the choice is no, then αψ\alpha^{\psi}=0.0. Results (STDEV\leq0.01) are averaged over 5 test runs on dataset Replica under the following experimental settings: αξ\alpha^{\xi}=1.0, κ\kappa=2, ρ\rho=-1.0, λ\lambda=0.1. It can be seen from the Table 6 that the best model is: αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0.

αζ\alpha^{\zeta}=1.0 αψ\alpha^{\psi}=-1.0 αϕ\alpha^{\phi}=1.0 WCR (\uparrow) PE (\downarrow) CR (\uparrow) RTE (\downarrow) SiSDR (\uparrow)
0.4327 3.4883 0.4742 10.9565 23.8669
0.5215 3.7912 0.5745 14.6120 23.1906
0.4464 3.7224 0.4907 12.5532 23.0666
0.5650 3.4742 0.6211 15.0615 23.7385
0.4663 3.2902 0.5101 14.7773 23.9743
0.5468 3.2901 0.5996 13.8064 23.8852
0.6977 3.2509 0.7669 13.8896 23.6501
Table 6: Ablation on using αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0.

The ablations about memory size κ\kappa. κ\kappa denotes the memory size for the memory encoder used in our model. Results (STDEV\leq0.01) are averaged over 5 test runs on dataset Replica under the experimental settings: αξ\alpha^{\xi}=1.0, αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0, ρ\rho=-1.0, λ\lambda=0.1. Table 7 shows that κ\kappa=2 is the best choice.

κ\kappa WCR (\uparrow) PE (\downarrow) CR (\uparrow) RTE (\downarrow) SiSDR (\uparrow)
0 0.6006 3.1518 0.6582 15.3408 19.1678
2 0.6977 3.2509 0.7669 13.8896 23.6501
4 0.4930 3.1751 0.5389 15.0578 22.8031
8 0.5006 3.7761 0.5513 22.6044 24.9619
16 0.5319 3.1821 0.5822 14.1616 22.8854
Table 7: Ablation on κ\kappa.

A.5 More Experiments for MACMA

We present some results for MACMA below.

Qualitative comparison of RIR prediction in the waveform. In Figure 9, we present the RIRs generated by these models together with the ground truth.

Refer to caption
Figure 9: Qualitative comparison of RIR prediction (Binaural RIR with channel 0 and channel 1) in waveform from Replica (upper two rows) and Matterport3D (lower two rows). The last column is the RIR ground truth.

These binaural RIRs with channels 0 and 1 last for one second (the x-axis is the time axis). The shape of the RIRs from Random’s prediction is close to ground truth, but the details need to be more accurate both in Replica and Matterport3D datasets. The RIRs from the Nearest neighbor’s prediction for channel 0 are very close to ground truth, but the ones for channel one could be better both in Replica and Matterport3D datasets. Both channels of the predicted RIR from Occupancy, Curiosity and MACMA are very close to the ground truth. Still, the tail part of the predicted RIR from Occupancy and Curiosity is worse than those of MACMA in Replica and Matterport3D datasets. All in all, from a pure visual point of view, the RIR waveforms generated by our model are of the highest quality.

Visualization of learned features and states. In MACMA framework (cf. section 3), the encoder EωE_{\omega}, EνE_{\nu}, EmE_{m}, and ErE_{r} generate visual feature ftω,if^{\omega,\,i}_{t}, ftν,if^{\nu,\,i}_{t}, frf_{r} and fmf_{m}, respectively. The disengagement quality of these learned features and states is important to the downstream policy learning or RIR generate. In Figure 10, we examine the semantics of visual features by overlaying the output of the visual encoder (from different layers) over the RGB images. It is easy to see that the visual encoder has learned to pay more attention to the area (in red color) where the robot can walk. This effect becomes more evident as the encoder becomes deeper.

Refer to caption
(a) EωE_{\omega}.
Refer to caption
(b) EνE_{\nu}.
Refer to caption
(c) EmE_{m}.
Refer to caption
(d) ErE_{r}.
Figure 10: Visualization of the learned features for different encoders (a) EωE_{\omega}, (b) EνE_{\nu}, (c) EmE_{m}, (d) ErE_{r}.

A.6 On Dynamic Modality Importance for Action Selection

We postulate that at any given time step, the relative effects of visual (v), azimuth (a), and position (p) on agent decisions may vary. To test our hypothesis, we 1) replace one of the above three modalities (v, a, p) with random noise (such as v), 2) put the semi-corrupted input and the undamaged input in the trained model, and 3) get both the intervention action distribution and the original action distribution for a specific agent (e.g., agent 0), and 4) calculate the change, do,vd_{o,v} denotes the KL divergence of the original action distribution (o), and the intervened action distribution (v), which indicate the changes brought about by the intervention. Then we do the same processes for the other modal inputs (e.g., a, p) in turn to get the change in azimuth intervention do,ad_{o,a} and changes in position intervention do,pd_{o,p}. do,ad_{o,a} denotes the KL divergence of the original action distribution (o) and the action distribution after the azimuth random intervention (a), do,pd_{o,p} denotes the KL divergence of the original action distribution (o) and the intervention action distribution (p). Finally, normalize do,vd_{o,v}, do,a\,d_{o,a} and do,pd_{o,p} to get do,vd^{*}_{o,v}, do,a\,d^{*}_{o,a} and do,p{}^{*}d_{o,p}. do,vd^{*}_{o,v}, do,ad^{*}_{o,a} and do,p{}^{*}d_{o,p} describe visual, azimuth, and position, respectively, which indicate the influence on the agent’s decision-making (Equation 18). Intuitively, the more drastic the normalized action changes, the more dependent the agent is on that modality. Since there are two agents, the modality’s influence value needs to be calculated separately for each agent. The experimental intervention results for two agents are in Figure 11 (top for agent 0 and bottom for agent 1).

Refer to caption
Figure 11: Dynamic vision, azimuth and position impact for two agents in an episode. Columns correspond to three sampled time steps. The orange, green and pink bars represent the importance of vision, azimuth and position, respectively.
do,v=do,v/(do,v+do,a+do,p),do,a=do,a/(do,v+do,a+do,p),do,p=do,p/(do,v+do,a+do,p),\begin{split}&d^{*}_{o,v}=\nicefrac{{d_{o,v}}}{{(d_{o,v}+d_{o,a}+d_{o,p})}},\quad d^{*}_{o,a}=\nicefrac{{d_{o,a}}}{{(d_{o,v}+d_{o,a}+d_{o,p})}},\quad\\ &d^{*}_{o,p}=\nicefrac{{d_{o,p}}}{{(d_{o,v}+d_{o,a}+d_{o,p})}},\end{split} (18)

A.7 On Dynamic Modality Importance for the RIR Measurement

We hypothesized that the role of visual (v), azimuth (a), and position (p) in predicting the room impulse response might vary at any given time step. To test our hypothesis, 1) we replace one modality (e.g., v) of the input (v, a, p) with random noise, 2) test this intervention and non-intervention input using the trained model for 1000 episodes, and 3) record the results of both the intervention test result PE(v) and test result PE without intervention, 4) calculate the change, we use δPEv\delta^{v}_{PE} to denote the absolute change in test metric PE after visual input intervention (v). By the same steps, we can get δPEa\delta^{a}_{PE} (denote the absolute change in test metric PE after azimuth input intervention), and δPEp\delta^{p}_{PE} (indicate the total shift in test metric PE after Position input intervention).

δPEv=|PEPE(v)|,δPEa=|PEPE(a)|,δPEp=|PEPE(p)|,δPEv=δPEv/(δPEv+δPEa+δPEp),δPEa=δPEa/(δPEv+δPEa+δPEp),δPEp=δPEp/(δPEv+δPEa+δPEp),\begin{split}&\delta^{v}_{PE}=|PE-PE(v)|,\quad\quad\delta^{a}_{PE}=|PE-PE(a)|,\quad\quad\\ &\delta^{p}_{PE}=|PE-PE(p)|,\quad\quad\delta^{v\star}_{PE}=\nicefrac{{\delta^{v}_{PE}}}{{(\delta^{v}_{PE}+\delta^{a}_{PE}+\delta^{p}_{PE})}},\\ &\delta^{a\star}_{PE}=\nicefrac{{\delta^{a}_{PE}}}{{(\delta^{v}_{PE}+\delta^{a}_{PE}+\delta^{p}_{PE})}},\quad\quad\delta^{p\star}_{PE}=\nicefrac{{\delta^{p}_{PE}}}{{(\delta^{v}_{PE}+\delta^{a}_{PE}+\delta^{p}_{PE})}},\end{split} (19)

Finally, normalize δPEv\delta^{v}_{PE}, δPEa\delta^{a}_{PE}, and δPEp\delta^{p}_{PE} to get δPEv\delta^{v\star}_{PE}, δPEa\delta^{a\star}_{PE}, and δPEp\delta^{p\star}_{PE}. δPEv\delta^{v\star}_{PE}, δPEa\delta^{a\star}_{PE}, and δPEp\delta^{p\star}_{PE} characterizes the impact of vision, azimuth and position on RIR prediction respectively in Equation 19. Intuitively, the larger the value of δPEv\delta^{v\star}_{PE}, δPEa\delta^{a\star}_{PE}, and δPEp\delta^{p\star}_{PE}, the more dependent the predicted RIR is on that modality. Since there are two agents, the modality’s influence value needs to be calculated separately for each agent. The experimental intervention results for two agents are in Figure 11 (Top for agent 0 and bottom for agent 1). Intuitively, the more significant the normalized damage degree, the greater the influence of this mode. Since there are two agents, the influence value of the modality needs to be calculated separately for each agent. We test on dataset Replica under the settings: αξ\alpha^{\xi}=1.0, αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0, λ\lambda=0.1, κ\kappa=2, ρ\rho=-1.0. We can draw that the position input has the most significant impact on the prediction accuracy of RIR from Table 8.

Random intervene agent 0 (%\%) agent 1 (%\%)
Vision Azimuth Position δPE\delta_{PE} δWCR\delta_{WCR} δRTE\delta_{RTE} δPE\delta_{PE} δWCR\delta_{WCR} δRTE\delta_{RTE}
0.14 0.38 0.19 0.11 0.33 0.41
16.53 19.28 47.92 33.64 28.24 9.80
83.33 80.34 51.89 66.25 71.43 89.79
Table 8: Dynamic importance of modality for the measure.

A.8 Visualization of Agents’ State Features

It is more challenging to visualize the disengagement quality of the state representations stωs^{\omega}_{t} and stνs^{\nu}_{t}. As a result, we choose to perform dimension reduction (to two dimensions) and clustering using UMAP McInnes et al. (2018). We show the UMAP result in Figure 12 with a color coding representing the action selected by the robot.

Refer to caption
Figure 12: UMAP of agent 0’s state features (left) and agent 1’s state features (right).

The learned state representations correlate naturally with the robot’s action selection in Figure 12.

A.9 Algorithm Parameters

The parameters used in our model are specified in Table  9.

Parameter Replica(Matterport3D)
number of updates 40000(60000)
use linear learning rate decay False
use linear clip decay False
RIR sampling rate 16000
clip param 0.1
ppo epoch 4
num mini batch 1
value loss coef 0.5
entropy coef 0.02
learning rate 2×1042\times 10^{-4}
max grad norm 0.5
num steps 150
use GAE (Generalized Advantage Estimation) True
reward window size 50
window length 600
window type “hann window”
number of processes 5(10)
wmsew^{mse} 1.0
wmω,wmν,wm,wξw^{\omega}_{m},w^{\nu}_{m},w_{m},w_{\xi} 1/2\nicefrac{{1}}{{2}}
αξ,αζ,αϕ\alpha^{\xi},\alpha^{\zeta},\alpha^{\phi} 1.0
αψ\alpha^{\psi} -1.0
β\beta 0.01
γ\gamma 0.99
τ\tau 0.95
κ\kappa 1
λ\lambda 0.1
fft size 1024
shift size 120
hidden size 512
optimizer Adam
Table 9: Algorithm parameters.

Appendix B MACMARA: MACMA with Reward Assignment

We present the extension MACMARA (Measuring Acoustics with Collaborative Multiple Agents with a dynamic Reward Assignment) here. MACMARA is the MACMA add a learnable Reward Assignment module for environmental reward.

B.1 The Overview of MACMARA

We also explored a model with with reward assignment. This model has two cooperating agents moving in a 3D environment, using vision, position and azimuth to measure the RIR. The proposed model mainly consists of four parts: agent 0, agent 1, RIR measurement and environmental reward assignment(see Figure 13). Given egocentric vision, azimuth, and position inputs, our model encodes these multi-modal cues to 1) determine the action for agents and evaluate the action taken by the agents for policy optimization, 2) measure the room impulse response and evaluate the regression accuracy for the RIR generator, and 3) evaluate the trade off between the agents’ exploration and RIR measurement, 4) environmental reward assignment. The two agents repeat this process until the maximal steps has been reached.

As 1), 2), and 3) introduced in section 3 of the main paper, we mainly introduced the environmental reward assignment module here.

B.2 Environmental Reward Assignment for MACMARA

Here we focus on a learnable reward assignment module for environmental reward in MACMARA.

Refer to caption
Figure 13: The MACMARA architecture: the agent 0 and the agent 1 first learn to encode observations as stωs^{\omega}_{t} and stνs^{\nu}_{t} respectively using encoder EωE_{\omega} and EνE_{\nu}, which are fed to actor-critic networks to predict the next action atωa_{t}^{\omega} and atνa_{t}^{\nu}. The reward assignment module is responsible for decomposing the reward into the agent 0’s component rtωr_{t}^{\omega} and the agent 1’s component rtνr_{t}^{\nu}. The RIR Measurement learns how to predict room impulse response W^t\hat{W}_{t} guided by ground truth WtW_{t}. The reward assignment learns how to assign reward rtr_{t} given by the environment supervised by rtr_{t}.

Environmental reward assignment. Environmental reward assignment is implemented by a multilayer perceptron. Given stωs^{\omega}_{t}, stνs^{\nu}_{t} and rtr_{t} at time step tt, this module output rtωr^{\omega}_{t} and rtνr^{\nu}_{t}. To enforce this learning follow a desired direction, we add rtωr^{\omega}_{t} and rtνr^{\nu}_{t} to get r^t\hat{r}_{t}. Reward assignment is learned with the guide signal rtr_{t}. The optimization objective of reward assignment is making r^t\hat{r}_{t} close to rtr_{t}. We define a regression loss σ\mathcal{L}^{\sigma} to perform this optimization:

σ=(rrωrν)2,\mathcal{L}^{\sigma}=\sum(r-{r}^{\omega}-{r}^{\nu})^{2}, (20)

where rω{r}^{\omega} and rν{r}^{\nu} is respectively the predicted reward (for the tt-th time step) of the agent 0 and agent 1. r{r} is the reward obtained from the environment. σ\mathcal{L}^{\sigma} is the regression loss for reward assignment module. The distribution of rewards is a combination of trainable and fixed weighting:

ρω=1ρ2+ρRωρ,ρν=1ρ2+ρRνρ,ρ=(1ρRωρRν)2,0ρ1,\begin{split}&\rho^{\omega}=\frac{1-\rho}{2}+\rho_{R}^{\omega}\cdot\rho,\quad\rho^{\nu}=\frac{1-\rho}{2}+\rho_{R}^{\nu}\cdot\rho,\quad\\ &\mathcal{L}^{\rho}=\sum(1-\rho_{R}^{\omega}-\rho_{R}^{\nu})^{2}\ ,\quad 0\leq\rho\leq 1\ ,\end{split} (21)

where ρ\rho is a constant weight parameter; ρRω\rho_{R}^{\omega} and ρRν\rho_{R}^{\nu} are reward weights predicted by neural networks; ρω\rho^{\omega} and ρν\rho^{\nu} are immediate reward weights. ρ\mathcal{L}^{\rho} is equivalent to σ\mathcal{L}^{\sigma} in Equation 20. Reward rω{r}^{\omega} and rν{r}^{\nu} can be respectively calculated with rω=rρωr^{\omega}=r\cdot\rho^{\omega} and rν=rρνr^{\nu}=r\cdot\rho^{\nu}.

B.3 The Loss of MACMARA

The total loss of our model are formulated as Equation 22. We minimize \mathcal{L} following Proximal Policy Optimization (PPO) Schulman et al. (2017).

=wmm+wξξ+wσσ,\mathcal{L}=w_{m}\cdot\mathcal{L}^{m}+w_{\xi}\cdot\mathcal{L}^{\xi}+w_{\sigma}\cdot\mathcal{L}^{\sigma}, (22)

where m\mathcal{L}^{m} is the loss component of motion for two agents, ξ\mathcal{L}^{\xi} is the loss component of room impulse response prediction, σ\mathcal{L}^{\sigma} is the loss component of reward assignment for two agents, wm,wσ,wξw_{m},\,w_{\sigma},\,w_{\xi}\, are hyperparameters. The loss of m\mathcal{L}^{m} are formulated as Equation 2. The loss of ξ\mathcal{L}^{\xi} are formulated as Equation 4. The loss of σ\mathcal{L}^{\sigma} are formulated as Equation 20. wm=wξ=wσ=1/3w_{m}=w_{\xi}=w_{\sigma}=\nicefrac{{1}}{{3}}.

B.4 Formulation of MACMARA

We denote the agent 0 and agent 1 with superscript ω\omega and ν\nu, respectively. The game =(𝒮,(𝒜ω,𝒜ν),𝒫,(ω,ν))\mathcal{M}=(\mathcal{S},(\mathcal{A}^{\omega},\mathcal{A}^{\nu}),\mathcal{P},(\mathcal{R}^{\omega},\mathcal{R}^{\nu})) consists of state set 𝒮\mathcal{S}, action sets 𝒜ω\mathcal{A}^{\omega}\, and 𝒜ν\mathcal{A}^{\nu}\,, and a joint state transition function 𝒫\mathcal{P}\,: 𝒮×𝒜ω×𝒜ν𝒮\mathcal{S}\times\mathcal{A}^{\omega}\times\mathcal{A}^{\nu}\rightarrow\mathcal{S}. The reward function ω:𝒮×𝒜ω×𝒜ν×𝒮\mathcal{R}^{\omega}\,:\mathcal{S}\times\mathcal{A}^{\omega}\times\mathcal{A}^{\nu}\times\mathcal{S}\rightarrow\mathbb{R}\, for agent 0 and ν:𝒮×𝒜ω×𝒜ν×𝒮\mathcal{R}^{\nu}:\mathcal{S}\times\mathcal{A}^{\omega}\times\mathcal{A}^{\nu}\times\mathcal{S}\rightarrow\mathbb{R}\, for agent 1 respectively depends on the current state, next state and both the agent 0’s and the agent 1’s actions. Each player wishes to maximize their discounted sum of rewards. rr is the reward given by the environment at every time step in an episode. MACMARA is modeled as a multi-agent Sunehag et al. (2018); Rashid et al. (2018) problem involving two collaborating players sharing the same goal:

min.\displaystyle min. \displaystyle\quad\mathcal{L} (23)
s.t. π=(π,ω,π,ν)=argmaxπωΠω,πνΠνG(πω,πν,r)\displaystyle\quad\pi^{\star}=(\pi^{\star,\,\omega},\pi^{\star,\,\nu})=\!\!\!\!\mathop{\arg\max}\limits_{\pi^{\omega}\in\Pi^{\omega},\,\pi^{\nu}\in\Pi^{\nu}}G(\pi^{\omega},\,\pi^{\nu},\,r)
=(argmaxπωΠωG(πω,r),argmaxπνΠνG(πν,r)),\displaystyle=\left(\mathop{\arg\max}\limits_{\pi^{\omega}\in\Pi^{\omega}}G(\pi^{\omega},\,r),\ \mathop{\arg\max}\limits_{\pi^{\nu}\in\Pi^{\nu}}G(\pi^{\nu},\,r)\right),
G(πω,πν,r)=wωG(πω,r)+wνG(πν,r),\displaystyle\quad G(\pi^{\omega}\!,\pi^{\nu}\!,r)\!=\!w^{\omega}G(\pi^{\omega}\!,r)+w^{\nu}G(\pi^{\nu}\!,r),
G(πω,r)=t=0γtrtρω,G(πν,r)=t=0γtrtρν,\displaystyle\quad G(\pi^{\omega}\!,r)\!=\!\textstyle\sum_{t=0}\gamma^{t}r_{t}\rho^{\omega},\quad G(\pi^{\nu}\!,r)\!=\!\sum_{t=0}\gamma^{t}r_{t}\rho^{\nu},
ρω=(1ρ)/2+ρρRω,ρν=(1ρ)/2+ρρRν,\displaystyle\quad\rho^{\omega}\!\!=(1-\rho)/2+\rho\cdot\rho_{R}^{\omega},\quad\rho^{\nu}\!\!=(1-\rho)/2+\rho\cdot\rho_{R}^{\nu},
ρRω+ρRν=1,wω>0,wν>0,0ρ1,\displaystyle\quad\rho_{R}^{\omega}+\rho_{R}^{\nu}=1,\quad w^{\omega}>0,\quad w^{\nu}>0,\quad 0\leq\rho\leq 1,
0ρRω1,0ρRν1,\displaystyle\quad 0\leq\rho_{R}^{\omega}\leq 1,\quad 0\leq\rho_{R}^{\nu}\leq 1,
min.ξ,min.σ,\displaystyle\quad min.\quad\mathcal{L}^{\xi},\quad min.\quad\mathcal{L}^{\sigma},

where \mathcal{L} is defined in Equation 22. G(πω,πν,r)G(\pi^{\omega},\,\pi^{\nu},\,r) is the expected joint rewards for agent 0 and agent 1 as a whole. G(πω,r)G(\pi^{\omega},\,r) and G(πν,r)G(\pi^{\nu},\,r) are the discounted and cumulative rewards for agent 0 and agent 1, respectively. wωw^{\omega} and wνw^{\nu} denote the constant cumulative rewards balance factors for agent 0 and agent 1, respectively. ρω\rho^{\omega} and ρν\rho^{\nu} are immediate reward contribution for agent 0 and agent 1, respectively. ρRω\rho_{R}^{\omega} and ρRν\rho_{R}^{\nu} are present the trainable reward decomposition weights. ρ\rho is a constant (throughout the training) reward allocation parameter. The loss of ξ\mathcal{L}^{\xi} are formulated as Equation 4. The loss of σ\mathcal{L}^{\sigma} are formulated as Equation 20.

B.5 Experimental Results of MACMARA

In experiments below, we use two models: MACMA and the extension MACMARA (Measuring Acoustics with Collaborative Multiple Agents with a dynamic reward assignment module), to present the ablation on ρ\rho.

ρ\rho is a constant reward weight (See Equation 21). Results are run on dataset Replica under the experimental settings: αξ\alpha^{\xi}=1.0, αζ\alpha^{\zeta}=1.0, αψ\alpha^{\psi}=-1.0, αϕ\alpha^{\phi}=1.0, κ\kappa=2, λ\lambda=0.1. As can be seen from Table  10 ρ\rho=-1.0 is the best choice.

ρ\rho WCR (\uparrow) PE (\downarrow) CR (\uparrow) RTE (\downarrow) SiSDR (\uparrow)
-1.0 0.6977 3.2509 0.7669 13.8896 23.6501
\hdashline0.0 0.6614 3.5727 0.7288 12.6987 23.1183
0.2 0.5287 3.3398 0.5798 14.0822 23.6715
0.4 0.5040 3.1860 0.5512 15.3278 23.6485
0.6 0.5384 3.3099 0.5904 12.4184 23.2581
0.8 0.4510 2.9755 0.4903 14.5194 23.8725
1.0 0.5471 3.2316 0.5995 12.9246 23.8201

{\heartsuit}: Without reward assignment module, and every agent get all the environment reward (rtω=rtν=rtr^{\omega}_{t}=r^{\nu}_{t}=r_{t}).

{\clubsuit}: With reward assignment module, and every agent get all the environment reward (rtω=rtν=0.5rtr^{\omega}_{t}=r^{\nu}_{t}=0.5r_{t}).

Table 10: Ablation on ρ\rho.

B.6 Algorithm Parameters in MACMARA

The parameters used in our model MACMARA are specified in Table 9 except wm=wξ=wσ=1/3w_{m}=w_{\xi}=w_{\sigma}=\nicefrac{{1}}{{3}}.

References

  • Cao et al. [2016] Chunxiao Cao, Zhong Ren, Carl Schissler, Dinesh Manocha, and Kun Zhou. Interactive sound propagation with bidirectional path tracing. ACM Trans. Graph., 35(6):180:1–180:11, 2016.
  • Chang et al. [2017] Angel X. Chang, Angela Dai, Thomas A. Funkhouser, Maciej Halber, Matthias Nießner, et al. Matterport3d: Learning from RGB-D data in indoor environments. In International Conference on 3D Vision, pages 667–676, 2017.
  • Chen et al. [2020] Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman. Soundspaces: Audio-visual navigation in 3d environments. In ECCV, pages 17–36, 2020.
  • Chen et al. [2021a] Changan Chen, Ziad Al-Halah, and Kristen Grauman. Semantic audio-visual navigation. In CVPR, pages 15516–15525, 2021.
  • Chen et al. [2021b] Changan Chen, Sagnik Majumder, Ziad Al-Halah, Ruohan Gao, Santhosh Kumar Ramakrishnan, and Kristen Grauman. Learning to set waypoints for audio-visual navigation. In ICLR, 2021.
  • Chen et al. [2022] Changan Chen, Ruohan Gao, Paul Calamia, and Kristen Grauman. Visual acoustic matching. In CVPR, pages 18836–18846, 2022.
  • Christensen et al. [2020] Jesper Haahr Christensen, Sascha Hornauer, and Stella X. Yu. Batvision: Learning to see 3d spatial layout with two ears. In ICRA, pages 1581–1587, 2020.
  • Dean et al. [2020] Victoria Dean, Shubham Tulsiani, and Abhinav Gupta. See, hear, explore: Curiosity via audio-visual association. In NeurIPS, 2020.
  • Diaz-Guerra et al. [2021] David Diaz-Guerra, Antonio Miguel, and José Ramón Beltrán. gpurir: A python library for room impulse response simulation with GPU acceleration. Multim. Tools Appl., 80(4):5653–5671, 2021.
  • Foerster et al. [2018] Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In AAAI, pages 2974–2982, 2018.
  • Gan et al. [2020] Chuang Gan, Yiwei Zhang, Jiajun Wu, Boqing Gong, and Joshua B. Tenenbaum. Look, listen, and act: Towards audio-visual embodied navigation. In ICRA, pages 9701–9707, 2020.
  • Gan et al. [2022] Chuang Gan, Yi Gu, Siyuan Zhou, Jeremy Schwartz, Seth Alter, James Traer, Dan Gutfreund, Joshua B Tenenbaum, Josh H McDermott, and Antonio Torralba. Finding fallen objects via asynchronous audio-visual integration. In CVPR, pages 10513–10523, 2022.
  • Gao et al. [2020] Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, and Kristen Grauman. Visualechoes: Spatial image representation learning through echolocation. In ECCV, pages 658–676, 2020.
  • Garg et al. [2021] Rishabh Garg, Ruohan Gao, and Kristen Grauman. Geometry-aware multi-task learning for binaural audio generation from video. In BMVC, page 1, 2021.
  • Jiang et al. [2022] Hao Jiang, Calvin Murdock, and Vamsi Krishna Ithapu. Egocentric deep multi-channel audio-visual active speaker localization. In CVPR, pages 10534–10542, 2022.
  • Karras et al. [2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In NeurIPS, pages 6379–6390, 2017.
  • Luo et al. [2022] Andrew Luo, Yilun Du, Michael Tarr, Josh Tenenbaum, Antonio Torralba, and Chuang Gan. Learning neural acoustic fields. CoRR, abs/2204.00628, 2022.
  • Majumder and Grauman [2022] Sagnik Majumder and Kristen Grauman. Active audio-visual separation of dynamic sound sources. In ECCV, pages 551–569, 2022.
  • Majumder et al. [2022] Sagnik Majumder, Changan Chen, Ziad Al-Halah, and Kristen Grauman. Few-shot audio-visual learning of environment acoustics. CoRR, abs/2206.04006, 2022.
  • Masztalski et al. [2020] Piotr Masztalski, Mateusz Matuszewski, Karol Piaskowski, and Michal Romaniuk. Storir: Stochastic room impulse response generation for audio data augmentation. In Interspeech, pages 2857–2861, 2020.
  • McInnes et al. [2018] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. UMAP: uniform manifold approximation and projection. J. Open Source Softw., 3(29):861, 2018.
  • Mehra et al. [2013] Ravish Mehra, Nikunj Raghuvanshi, Lakulish Antani, Anish Chandak, Sean Curtis, and Dinesh Manocha. Wave-based sound propagation in large open scenes using an equivalent source formulation. ACM Trans. Graph., 32(2):19:1–19:13, 2013.
  • Michelsanti et al. [2021] Daniel Michelsanti, Zheng-Hua Tan, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu, and Jesper Jensen. An overview of deep-learning-based audio-visual speech enhancement and separation. IEEE ACM Trans. Audio Speech Lang. Process., 29:1368–1396, 2021.
  • Mildenhall et al. [2022] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1):99–106, 2022.
  • Morgado et al. [2018] Pedro Morgado, Nuno Vasconcelos, Timothy R. Langlois, and Oliver Wang. Self-supervised generation of spatial audio for 360° video. In NeurIPS, pages 360–370, 2018.
  • Pathak et al. [2017] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, pages 2778–2787, 2017.
  • Purushwalkam et al. [2021] Senthil Purushwalkam, Sebastia Vicenc Amengual Gari, Vamsi Krishna Ithapu, Carl Schissler, Philip Robinson, Abhinav Gupta, and Kristen Grauman. Audio-visual floorplan reconstruction. In ICCV, pages 1163–1172, 2021.
  • Rachavarapu et al. [2021] Kranthi Kumar Rachavarapu, Aakanksha, Vignesh Sundaresha, and A. N. Rajagopalan. Localize to binauralize: Audio spatialization from visual sound source localization. In ICCV, pages 1910–1919, 2021.
  • Ramakrishnan et al. [2020] Santhosh K. Ramakrishnan, Ziad Al-Halah, and Kristen Grauman. Occupancy anticipation for efficient exploration and navigation. In ECCV, pages 400–418, 2020.
  • Rashid et al. [2018] Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, pages 4292–4301, 2018.
  • Rashid et al. [2020] Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted QMIX: expanding monotonic value function factorisation for deep multi-agent reinforcement learning. In NeurIPS, 2020.
  • Ratnarajah et al. [2021] Anton Ratnarajah, Zhenyu Tang, and Dinesh Manocha. IR-GAN: room impulse response generator for far-field speech recognition. In Interspeech, pages 286–290, 2021.
  • Ratnarajah et al. [2022a] Anton Ratnarajah, Zhenyu Tang, Rohith Aralikatti, and Dinesh Manocha. MESH2IR: neural acoustic impulse response generator for complex 3d scenes. In MM ’22: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, pages 924–933. ACM, 2022.
  • Ratnarajah et al. [2022b] Anton Ratnarajah, Shi-Xiong Zhang, Meng Yu, Zhenyu Tang, Dinesh Manocha, and Dong Yu. Fast-rir: Fast neural diffuse room impulse response generator. In ICASSP, pages 571–575, 2022.
  • Ruohan and Kristen [2019] Gao Ruohan and Grauman Kristen. 2.5 d visual sound. In CVPR, pages 324–333, 2019.
  • Savioja and Svensson [2015] Lauri Savioja and U Peter Svensson. Overview of geometrical room acoustic modeling techniques. The Journal of the Acoustical Society of America, 138(2):708–730, 2015.
  • Savva et al. [2019] Manolis Savva, Jitendra Malik, Devi Parikh, Dhruv Batra, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, and Vladlen Koltun. Habitat: A platform for embodied AI research. In ICCV, pages 9338–9346, 2019.
  • Schissler et al. [2014] Carl Schissler, Ravish Mehra, and Dinesh Manocha. High-order diffraction and diffuse reflections for interactive sound propagation in large environments. TOG, 33(4):1–12, 2014.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
  • Shao et al. [2022] Yiwen Shao, S. Z., and Dong Yu. Multi-channel multi-speaker ASR using 3d spatial feature. In ICASSP, pages 6067–6071, 2022.
  • Singh et al. [2021] Nikhil Singh, Jeff Mentch, Jerry Ng, Matthew Beveridge, and Iddo Drori. Image2reverb: Cross-modal reverb impulse response synthesis. In ICCV, pages 286–295, 2021.
  • Straub et al. [2019] Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces. CoRR, abs/1906.05797, 2019.
  • Sunehag et al. [2018] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward. In AAMAS, pages 2085–2087, 2018.
  • Tang et al. [2020] Zhenyu Tang, Lianwu Chen, Bo Wu, Dong Yu, and Dinesh Manocha. Improving reverberant speech training using diffuse acoustic simulation. In ICASSP, pages 6969–6973, 2020.
  • Tang et al. [2022] Zhenyu Tang, Rohith Aralikatti, Anton Jeran Ratnarajah, and Dinesh Manocha. GWA: A large high-quality acoustic dataset for audio processing. In SIGGRAPH, pages 36:1–36:9, 2022.
  • Taylor et al. [2012] Micah Taylor, Anish Chandak, Qi Mo, Christian Lauterbach, Carl Schissler, and Dinesh Manocha. Guided multiview ray tracing for fast auralization. IEEE Trans. Vis. Comput. Graph., 18(11):1797–1810, 2012.
  • Välimäki et al. [2016] Vesa Välimäki, Julian Parker, Lauri Savioja, Julius O Smith, and Jonathan Abel. More than 50 years of artificial reverberation. In Audio engineering society conference, 2016.
  • Wang et al. [2021] Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, and Chongjie Zhang. DOP: off-policy multi-agent decomposed policy gradients. In ICLR, 2021.
  • Xu et al. [2021] Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. Visually informed binaural audio generation without binaural audios. In CVPR, pages 15485–15494, 2021.
  • Yu et al. [2022a] Yinfeng Yu, Lele Cao, Fuchun Sun, Xiaohong Liu, and Liejun Wang. Pay self-attention to audio-visual navigation. In BMVC, 2022.
  • Yu et al. [2022b] Yinfeng Yu, Wenbing Huang, Fuchun Sun, Changan Chen, Yikai Wang, and Xiaohong Liu. Sound adversarial audio-visual navigation. In ICLR, 2022.
  • Yu et al. [2023] Yinfeng Yu, Lele Cao, Fuchun Sun, Chao Yang, Huicheng Lai, and Wenbing Huang. Echo-enhanced embodied visual navigation. Neural Computation, 35(5):958–976, 2023.