^†^†thanks: ORCID: 0000-0002-4158-6218^†^†thanks: A joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer IIS, Germany.

Simulating room transfer functions between transducers mounted on audio devices using a modified image source method

Zeyu Xu zeyu.xu@audiolabs-erlangen.de International Audio Laboratories, Am Wolfsmantel 33, 91058, Erlangen, Germany. Adrian Herzog International Audio Laboratories, Am Wolfsmantel 33, 91058, Erlangen, Germany. Alexander Lodermeyer Fraunhofer Institute for Integrated Circuits IIS, Am Wolfsmantel 33, 91058, Erlangen, Germany. Emanuël A. P. Habets International Audio Laboratories, Am Wolfsmantel 33, 91058, Erlangen, Germany. Albert G. Prinn Fraunhofer Institute for Integrated Circuits IIS, Am Wolfsmantel 33, 91058, Erlangen, Germany.

Abstract

The image source method (ISM) is often used to simulate room acoustics due to its ease of use and computational efficiency. The standard ISM is limited to simulations of room impulse responses between point sources and omnidirectional receivers. In this work, the ISM is extended using spherical harmonic directivity coefficients to include acoustic diffraction effects due to source and receiver transducers mounted on physical devices, which are typically encountered in practical situations. The proposed method is verified using finite element simulations of various loudspeaker and microphone configurations in a rectangular room. It is shown that the accuracy of the proposed method is related to the sizes, shapes, number, and positions of the devices inside a room. A simplified version of the proposed method, which can significantly reduce computational effort, is also presented. The proposed method and its simplified version can simulate room transfer functions more accurately than currently available image source methods and can aid the development and evaluation of speech and acoustic signal processing algorithms, including speech enhancement, acoustic scene analysis, and acoustic parameter estimation.

I Introduction

Room acoustic simulation has been an active research field since the 1960s [1]. It is of paramount importance for, e.g., auralization and virtual acoustics [2, 3, 4, 5, 6, 7, 8], evaluation of acoustic signal processing algorithms [9, 10, 11, 12, 13], and data generation for data-driven methods [14, 15, 16, 17, 18, 19, 20, 21]. Room acoustics simulation methods can be broadly categorized into geometric methods [22, 23, 24, 25, 26], wave-based methods [27, 28, 29, 30, 31, 32], hybrid methods [33, 34], and, more recently, deep-learning-based methods [8, 35, 36]. This paper presents an extension to the well-known geometrical method, the image source method (ISM), which enables the inclusion of acoustic diffraction effects from audio devices of finite extent in shoebox-shaped rooms.

Modern smart speakers are a typical embodiment of an audio device that contains loudspeakers and microphones. Generating a reverberant room acoustic simulation environment which includes the directivities of loudspeakers and microphones due to the transducer properties and the acoustic diffraction effects, is essential for developing and testing new acoustic signal processing algorithms. When using wave-based methods, the acoustic diffraction effects caused by the loudspeaker enclosure are implicitly modeled. However, because of the much higher computational cost of solving wave-based models, geometrical methods are often preferred. While assuming that sound waves behave like light rays does not apply to long wavelengths, Aretz et al. [37] have shown that the ISM can be used to approximate wave-based method solutions above the Schroeder frequency [38] in rectangular rooms by specifying angle-dependent reflection coefficients. Thus, the ISM can be used to obtain acceptable solutions across most of the audible frequency range above the Schroeder frequency.

The ISM has been implemented in many widely used toolboxes [39, 40, 41, 42], along with various approaches for reducing the computational complexity, for example, parallelization based on GPUs [43, 42]. Directional properties of the source and receiver were also incorporated into the ISM to produce more realistic room impulse responses [44, 39, 41, 45, 40, 46], and efforts have been made to include acoustic diffraction effects due to the human head [44] and rigid spherical baffles [47, 48]. Jarrett et al. [47] extended the ISM to include rigid spherical microphones using an analytical expression of the diffraction effects in the spherical harmonic domain. Samarasinghe et al. [49] further extended the ISM, viz. the Generalized ISM (GISM), to facilitate parameterization of the reverberant transfer function between a directional source and a directional receiver. More recently, an extension of the ISM to incorporate loudspeaker directivities in the far field in the spherical harmonic domain has been reported [50]. However, the existing approaches cannot simulate a transfer function between a directive source and a directive receiver that includes near-field diffraction effects.

In this work, the GISM [49] is extended to enable the incorporation of acoustic diffraction effects. First, we review the basic concepts of the ISM and its extension to the spherical harmonic domain in Sec. II. In Sec. III, we present our proposed room transfer function (RTF) parameterization, which utilizes the source directivity coefficients computed from solutions of an exterior radiation problem and receiver directivity coefficients obtained using reciprocity [51]. In addition, we derive a simplified version of the proposed method to reduce computational complexity. The Finite Element Method (FEM) is used to verify the proposed method by comparing RTFs with the source and receiver mounted on either different devices or on the same device. The experimental setups and the evaluation of the proposed method are presented in Sec. IV. The effect of varying the parameters of the device, e.g., its position in the room, physical shape, and size, on the accuracy of the proposed method is investigated. Finally, conclusions are drawn in Sec. V.

II The Image Source Method

II.1 Standard image source method

Table 1: Parameters and descriptions used in ISM

Parameter	Descriptions
$L_{x},L_{y},L_{z}$	Room dimensions: length, width and height
$\beta_{x1},\beta_{x2},\beta_{y1},\beta_{y2},\beta_{z1},\beta_{z2}$	Reflection coefficients of the walls
$\mathbf{p}=(p_{x},p_{y},p_{z})\in\mathcal{P}$	Triple determining whether the image source is mirrored
	w.r.t. walls at $x=0$ , $y=0$ or $z=0$ , elements take values of $0$ or $1$
$\mathbf{q}=(q_{x},q_{y},q_{z})\in\mathcal{Q}$	Triple determining higher order reflections
	each element can take values between $[-N_{m},N_{m}]$
$\mathbf{x}^{\text{sI}}_{\mathbf{p},\mathbf{q}}$	Source images
$\mathbf{x}^{\text{rI}}_{\mathbf{p},\mathbf{q}}$	Receiver images
$\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}=\mathbf{x}_{\text{r}}-\mathbf{x}^{\text{sI}}_{\mathbf{p},\mathbf{q}}$	Vector pointing from source images to the receiver
$\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}=-\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$	Vector pointing from receiver to source images
$\mathbf{R}^{\text{s}\to\text{rI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}}$	The reversed vector corresponding to the same reflection path of $\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}$

The ISM was first applied to acoustics by Carslaw [52] and later implemented using a digital computer by Allen and Berkley [22] to model the impulse response between a point source and an omnidirectional receiver in a reverberant rectangular room.

To derive the standard ISM, consider a rectangular room with length, width, and height given by $L_{x},L_{y}$ and $L_{z}$ , as well as an acoustic point source and an omidirectional receiver inside of the room at positions $\mathbf{x}_{\text{s}}=(x_{\text{s}},y_{\text{s}},z_{\text{s}})$ and $\mathbf{x}_{\text{r}}=(x_{\text{r}},y_{\text{r}},z_{\text{r}})$ , given in Cartesian coordinates, respectively. The RTF can be expressed as a weighted sum of free-space Green’s functions attenuated by the walls, that is:

\displaystyle H_{\text{ISM}}(\mathbf{x}_{\text{r}}|\mathbf{x}_{\text{s}},k)=\sum_{\mathbf{p}\in\mathcal{P}}\sum_{\mathbf{q}\in\mathcal{Q}}\beta(\mathbf{p},\mathbf{q})G(\mathbf{x}_{\text{r}}|\mathbf{x}^{\text{sI}}_{\mathbf{p},\mathbf{q}},k)~{},

(1)

where $k$ is the wavenumber. The triples $\mathbf{p}$ and $\mathbf{q}$ determine the position of the images and the attenuation factor $\beta(\mathbf{p},\mathbf{q})$ . The triple $\mathbf{p}=(p_{x},p_{y},p_{z})$ is used to determine whether the image source is mirrored w.r.t. walls at $x=0$ , $y=0$ or $z=0$ , and its elements take values of $0$ or $1$ , resulting in a set $\mathcal{P}$ of eight combinations. Each element of the other triple $\mathbf{q}=(q_{x},q_{y},q_{z})$ takes a value between $[-N_{m},N_{m}]$ , and it includes higher order reflections, where $N_{m}$ is used to limit computational complexity and circular convolution errors [47], resulting in a set $\mathcal{Q}$ of $(2N_{m}+1)^{3}$ combinations. $\beta(\mathbf{p},\mathbf{q})=\beta_{x_{1}}^{|q_{x}-p_{x}|}\beta_{x_{2}}^{|q_{x}|}\beta_{y_{1}}^{|q_{y}-p_{y}|}\beta_{y_{2}}^{|q_{y}|}\beta_{z_{1}}^{|q_{z}-p_{z}|}\beta_{z_{2}}^{|q_{z}|}$ and $\beta_{x_{1}},\beta_{x_{2}},\beta_{y_{1}},\beta_{y_{2}},\beta_{z_{1}},\beta_{z_{2}}$ are the angle-independent reflection coefficients of the walls, and the subscripts $a_{1},a_{2}$ ( $a\in\{x,y,z\}$ ) of $\beta_{(\cdot)}$ correspond to the boundaries at $a=0$ and the boundaries at $a=L_{a}$ , respectively. For an image source with position

\mathbf{x}^{\text{sI}}_{\mathbf{p},\mathbf{q}}=\begin{bmatrix}x_{\text{s}}-2p_{x}x_{\text{s}}+2q_{x}L_{x}\\ y_{\text{s}}-2p_{y}y_{\text{s}}+2q_{y}L_{y}\\ z_{\text{s}}-2p_{z}z_{\text{s}}+2q_{z}L_{z}\end{bmatrix}~{},

(2)

the Green’s function can be written as

G(\mathbf{x}_{\text{r}}|\mathbf{x}^{\text{sI}}_{\mathbf{p},\mathbf{q}},k)=\frac{e^{-i\,k||\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}||}}{4\pi||\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}||}~{},

(3)

where $i$ denotes the imaginary unit. The vector pointing from every image of the source to the receiver is expressed as

\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}=\mathbf{x}_{\text{r}}-\mathbf{x}^{\text{sI}}_{\mathbf{p},\mathbf{q}}=\begin{bmatrix}x_{\text{r}}-x_{\text{s}}+2p_{x}x_{\text{s}}-2q_{x}L_{x}\\ y_{\text{r}}-y_{\text{s}}+2p_{y}y_{\text{s}}-2q_{y}L_{y}\\ z_{\text{r}}-z_{\text{s}}+2p_{z}z_{\text{s}}-2q_{z}L_{z}\end{bmatrix}~{}.

(4)

II.2 Extensions of the standard ISM

Many extensions to the ISM have been proposed to facilitate more realistic modeling of room acoustics. In this section, we discuss two important extensions of the ISM relevant to the present study, namely, using angle-dependent reflection coefficients and including source and receiver directivities.

II.2.1 Angle-dependent reflection coefficients

In Eq. (1), angle-independent reflection coefficients are used. Aretz et al. [37] have shown that including the angle of incidence when specifying the boundary conditions in the ISM can result in solutions that agree with FEM solutions. As such, angle-dependent reflection coefficients are used in this work.

For each reflection path represented via the indices $(\mathbf{p},\mathbf{q})$ in the ISM, three incident angles in Cartesian coordinates are needed, i.e.,

\theta_{a}(\mathbf{p},\mathbf{q})=\arccos\bigg{(}\frac{a_{\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}}}{\|\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}\|}\bigg{)}~{},

(5)

where $a\in\{x,y,z\}$ , $a_{\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}}$ is the corresponding coordinate of the vector $\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ and $\theta_{a}(\mathbf{p},\mathbf{q})$ is the incident angle on the walls perpendicular to the $a-$ axis. The incident angle $\theta_{a}(\mathbf{p},\mathbf{q})$ is the same for the reflections at the walls perpendicular to the same axis in a rectangular room [37]. For a uniform normalized impedance, $\zeta$ , the angle-dependent reflection coefficients are given by [53]

\displaystyle\beta_{a_{2}}=

\displaystyle~{}\beta_{a_{1}}=\frac{\zeta\cos(\theta_{a}(\mathbf{p},\mathbf{q}))-1}{\zeta\cos(\theta_{a}(\mathbf{p},\mathbf{q}))+1}~{}.

(6)

From this point onward, the attenuation term due to the reflections, $\beta(\mathbf{p},\mathbf{q})$ , contains the angle-dependent reflection coefficients defined in Eq. (6). Note that, while the reflection properties of surfaces are generally frequency dependent, we use a frequency-independent impedance $\zeta$ for all walls in this work without loss of generality.

II.2.2 Extension to the spherical harmonic domain

In practice, the sound source and receiver in a room usually exhibit directional properties that are far more complex than those of a point source or an omnidirectional receiver. Various methods for incorporating source and receiver directivities into the ISM have been presented in the literature, e.g., [44, 39, 41, 45, 48, 40, 46, 15, 18]. In this section, we present the Spherical Microphone array Impulse Response (SMIR) generator proposed by Jarrett et al. [54].

The directivity of a source or receiver depends not only on the directional properties of the transducer itself but also on acoustic wave interaction with the physical object on which the transducer is mounted. The SMIR generator extends the ISM to the spherical harmonic domain, which facilitates the incorporation of the acoustic diffraction effect due to a rigid sphere and enables the simulation of RTFs for microphones mounted on such a sphere. When considering the receiver at $\mathbf{x}_{\text{r}}$ as the center of the spherical array with an omnidirectional microphone positioned at $\mathbf{y}^{({\text{r}})}$ , the SMIR generator computes the RTF via

\displaystyle H_{\text{SMIR}}(\mathbf{y}^{({\text{r}})}|\mathbf{x}_{\text{s}},k)=\sum_{\mathbf{p}\in\mathcal{P}}\sum_{\mathbf{q}\in\mathcal{Q}}\beta(\mathbf{p},\mathbf{q})G_{\text{N}}(\mathbf{y}^{({\text{r}})}|\widetilde{\mathbf{R}}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}},k)~{},

(7)

where $G_{N}(\cdot|\cdot)$ denotes the Green’s function fulfilling Neumann boundary conditions on the sphere and $\widetilde{\mathbf{R}}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}$ is the vector pointing from the center of the array to the source image in spherical coordinates. Detailed expressions of $H_{\text{SMIR}}(\mathbf{y}^{({\text{r}})}|\mathbf{x}_{\text{s}},k)$ can be found in the literature [54]. It is worth noting that the Green’s function given in Eq. (3) and the Neumann Green’s function in Eq. (7) use different coordinate systems for the input arguments.

The acoustic diffraction on the sphere affects the directivity of the microphones on the sphere. Further directional properties of the source and receiver are usually implemented as a gain factor for each reflection path in the ISM. For example, Eq. (7) is modified by considering the source directivity [48] yielding the expression:

	$\displaystyle H_{\text{SMIR+}}(\mathbf{y}^{({\text{r}})}\|\mathbf{x}_{\text{s}},k)=$	$\displaystyle\sum_{\mathbf{p}\in\mathcal{P}}\sum_{\mathbf{q}\in\mathcal{Q}}\beta(\mathbf{p},\mathbf{q})g(\theta^{e,o}_{\mathbf{p},\mathbf{q}},\phi^{e,o}_{\mathbf{p},\mathbf{q}},k)$
		$\displaystyle\times G_{\text{N}}(\mathbf{y}^{({\text{r}})}\|\tilde{\mathbf{R}}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}},k)~{},$		(8)

where $\theta^{e,o}_{\mathbf{p},\mathbf{q}},\phi^{e,o}_{\mathbf{p},\mathbf{q}}$ are the emission inclination and azimuth angles of the image source with indexes $(\mathbf{p},\mathbf{q})$ with respect to the orientation of the source directivity pattern. The angles $\theta^{e,o}_{\mathbf{p},\mathbf{q}},\phi^{e,o}_{\mathbf{p},\mathbf{q}}$ can be obtained by calculating the angles between the source orientation, e.g., the main direction of the directivity pattern at the cell with indexes $(\mathbf{p},\mathbf{q})$ , and the vector from the source image with indexes $(\mathbf{p},\mathbf{q})$ to the receiver. For first-order directivity patterns, $g(\cdot)$ , can be expressed as [48]

g(\theta^{e,o}_{\mathbf{p},\mathbf{q}},\phi^{e,o}_{\mathbf{p},\mathbf{q}},k)=\delta+(1-\delta)\cos{\theta^{e,o}_{\mathbf{p},\mathbf{q}}}\sin{\phi^{e,o}_{\mathbf{p},\mathbf{q}}}~{},

(9)

where $0\leq\delta\leq 1$ . The relation between the emission and impinging angles, i.e., the mirrored emission angles after all reflections, depends on the number of reflections by walls perpendicular to different axes. A detailed formulation of this relation is provided by Hafezi et al. [48]. However, the phase information of the source directivity is missing in this approach.

II.3 Generalized image source method

\fig

Figure1.eps0.5

Figure 1: Illustration of the scenario modeled by Samarasinghe et al. [49]. A directional source is positioned inside a rectangular room. The room transfer function from the source located at

\mathbf{x}_{\text{s}}

to an observation point

\mathbf{y}^{(\text{r})}

around the receiver located at

\mathbf{x}_{\text{r}}

can be simulated.

The method introduced in Eq. (8) to model directional sources and receivers is restricted to directivity patterns with known analytical expressions. In practice, directivity patterns may exhibit complex shapes without analytical expressions. Samarasinghe et al. [49] further extended the ISM to facilitate the parameterization of the transfer function between a source with arbitrary directivity and a receiver region. The directional properties of the receiver region can be formed using beamforming techniques based on the received sound field. In this section, we describe the GISM, which is the foundation of our proposed method.

The GISM decomposes the transfer function into three parts: i) an outgoing sound field from the directional source, ii) an incident sound field captured by the receiver that can be directional, and iii) the mode coupling coefficients that couple the source and receiver directivities in the spherical harmonic domain.

Consider an outgoing sound field from a directional source, for example a loudspeaker, located at $\mathbf{x}_{\text{s}}$ as shown in Fig. 1. The sound field observed at an arbitrary point $\mathbf{r}$ outside of the sphere enclosing the entire speaker can be computed as [55]:

P(\mathbf{r},k)=\sum_{n=0}^{\infty}\sum_{m=-n}^{n}C^{(\text{s})}_{n,m}(k)h_{n}(kd)Y_{n,m}(\theta,\phi)~{},

(10)

where $(d,\theta,\phi)$ denote the spherical coordinates associated with $\mathbf{r}$ , viz. the distance $d$ , the inclination angle $\theta$ and the azimuth angle $\phi$ . Moreover, $h_{n}(\cdot)$ denotes the spherical Hankel function of order $n$ , $Y_{n,m}$ denotes the spherical harmonic function of order $n$ and mode $m$ , and $C^{(\text{s})}_{n,m}(k)$ represent the spherical harmonic source directivity coefficients. In practice, the infinite summation in Eq.(10) is truncated to a maximum order $N=\left\lceil kr_{0}^{(\text{s})}\right\rceil$ where $r_{0}^{(\text{s})}$ denotes the radius of the sphere that is large enough to surround the loudspeaker [56, 57].

The room transfer function from the source $\mathbf{x}_{\text{s}}$ to any point $\mathbf{y}^{(\text{r})}$ with spherical coordinates $(d_{y}^{({\text{r}})},\theta_{y}^{({\text{r}})},\phi_{y}^{({\text{r}})})$ relative to $\mathbf{x}_{\text{r}}$ can be expressed using the GISM as [49]:

	$\displaystyle H_{\text{GISM}}(\mathbf{y}^{({\text{r}})}\|\mathbf{x}_{\text{s}},k)=$	$\displaystyle\sum_{v=0}^{V}\sum_{u=-v}^{v}\sum_{n=0}^{N}\sum_{m=-n}^{n}C^{(\text{s})}_{n,m}(k)\gamma_{v,u}^{n,m}(k)$
		$\displaystyle\times j_{v}(kd_{y}^{(\text{r})})Y_{v,u}(\theta_{y}^{(\text{r})},\phi_{y}^{(\text{r})})~{},$		(11)

where $V=\left\lceil k\,d_{y}^{(r)}\right\rceil$ denotes the truncation limit that produces a small error [58] and $j_{v}(\cdot)$ denotes the $n$ th-order spherical Bessel function of the first kind. The reverberant mode coupling coefficients $\gamma_{v,u}^{n,m}(k)$ are given by

	$\displaystyle\gamma_{v,u}^{n,m}(k)=$	$\displaystyle\sum_{\mathbf{p}\in\mathcal{P}}\sum_{\mathbf{q}\in\mathcal{Q}}\beta(\mathbf{p},\mathbf{q})(-1)^{(p_{y}+p_{z})m+p_{z}n}$
		$\displaystyle\times\alpha_{v,u}^{n,(-1)^{p_{x}+p_{y}}m}(\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})~{},$		(12)

where $\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ has spherical coordinates $(||\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}||,\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})$ , and $\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}=(\theta^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}},\phi^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})$ . The additional factors $(-1)^{(p_{y}+p_{z})m+p_{z}n}$ and $(-1)^{p_{x}+p_{y}}$ result from the mirror effect of the image sources [59] on the spherical harmonics, which is a more general description compared to the use of emission and impinging angles as applied in the SMIR+ method [48]. Using $\mathbf{x}_{\mathbf{0}}$ with spherical coordinates $(d_{x_{0}},\theta_{x_{0}},\phi_{x_{0}})$ instead of $\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ for brevity, the single-path mode-coupling coefficients $\alpha_{v,u}^{n,m}(\cdot)$ in Eq.(12) can be derived using the translation of fields [59],

	$\displaystyle\alpha_{v,u}^{n,m}(\mathbf{x}_{\mathbf{o}})=$	$\displaystyle 4\pi i^{v-n}(-1)^{m}\displaystyle\sum_{l=\|n-v\|}^{n+v}i^{l}h^{(2)}_{l}(kd_{x_{0}})$
		$\displaystyle\times Y_{l,m-u}(\theta_{x_{0}},\phi_{x_{0}})W_{1}W_{2}\xi~{},$		(13)

with $W_{1}$ and $W_{2}$ denoting Wigner 3 - j symbols,

\displaystyle W_{1}=\begin{pmatrix}n&v&l\\ 0&0&0\end{pmatrix}~{},~{}W_{2}=\begin{pmatrix}n&v&l\\ -m&u&m-u\end{pmatrix}~{},

(14)

and $\xi=\sqrt{(2n+1)(2v+1)(2l+1)/4\pi}$ .

II.4 Fast source room receiver method

The Fast Source Room Receiver (FSRR) method, proposed by Luo and Kim [50], also simulates the RTF in the spherical harmonic domain. The RTF parameterization includes an approximation of the spherical harmonic description of the source and receiver directivities and utilizes a simplification of the cross-modal coupling, i.e., the mode coupling coefficients in Eq. (12). The method also uses a randomized sign parameter. Notably, the FSRR method uses reversed reflection paths - a technique explored in this work for deriving a simplified version of the proposed method, which is presented in Sec. III.4.

Using our notation, the FSRR method RTF is given by

$\displaystyle H_{\text{FSRR}}$	$\displaystyle(\mathbf{x}_{\text{r}}\|\mathbf{x}_{\text{s}},k)=\sum_{\mathbf{p}\in\mathcal{P}}\sum_{\mathbf{q}\in\mathcal{Q}}B_{\mathbf{p},\mathbf{q}}M_{\mathbf{p},\mathbf{q}}\frac{e^{-ik\|\|\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}\|\|}}{\|\|\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}\|\|}$
	$\displaystyle\times\sum_{n=0}^{N}\sum_{m=-n}^{n}\tilde{C}^{(\text{s})}_{n,m}(k)Y_{n,m}(\Omega^{\text{s}\to\text{rI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}})$
	$\displaystyle\times\sum_{v=0}^{V}\sum_{u=-v}^{v}\tilde{C}^{(\text{r})}_{v,u}(k)Y_{v,u}(\Omega^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}})~{},$	(15)

where $B_{\mathbf{p},\mathbf{q}}$ randomly takes value between $1$ and $-1$ , $M_{\mathbf{p},\mathbf{q}}$ is a function fitted to frequency-dependent absorption coefficients, $\tilde{C}$ is a directivity coefficient obtained at 1 m from either the source or the receiver. The modified triples $\tilde{\mathbf{p}},\tilde{\mathbf{q}}$ are defined in Sec. III.4 (Eq. (30)), $\Omega^{\text{sI}\to\text{r}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}}$ is the direction of a vector pointing from a source image to the receiver, and $\Omega^{\text{r}\to\text{sI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}}$ is the direction of the vector corresponding to the reversed path of the same reflection. Definitions of the two vectors are given in Sec. III.4.

III Proposed method

We have seen that the SMIR generator relies on an analytical descriptions of directivity, the GISM simulates either a directive source or a directive receiver, and the FSRR method approximates the RTF between directional sources and receivers. In this section, we propose a method for simulating RTFs that can include near-field diffraction effects of both directional sources and directional receivers.

III.1 Problem formulation

\fig

Figure2.eps0.5

Figure 2: Illustration of the scenario considered for the proposed method. Two speakers are positioned inside a rectangular room. The source speaker has a vibrating piston on one of its surfaces located at

\mathbf{x}_{\text{s}}

. The receiver speaker has a microphone on the top surface located at

\mathbf{x}_{\text{r}}

. The room transfer function from

\mathbf{x}_{\text{s}}

\mathbf{x}_{\text{r}}

is of interest. Not shown here is the room in which the speakers are placed. The two spherical regions marked in light gray are nonoverlapping.

We consider the following scenario: one source device and one receiver device are positioned in a rectangular room, as depicted in Fig. 2. The source contains a sound-radiating transducer located at $\mathbf{x}_{\text{s}}$ , e.g., a vibrating piston on one surface, and a microphone located at $\mathbf{x}_{\text{r}}$ is mounted on the receiver. The two devices can have arbitrary shapes and materials, and the transducers can exhibit arbitrary directivity patterns. The acoustic field generated by the source transducer interacts with both devices and the walls and is captured by the microphone. We aim to incorporate the local diffraction effects around the devices into the RTFs based on the ISM. The existing methods are unable to resolve the problem under consideration due to the following reasons:

•

As shown by Xu et al.[51], the terms $j_{v}(kd_{y}^{(\text{r})})Y_{v,u}(\theta_{y}^{(\text{r})},\phi_{y}^{(\text{r})})$ corresponding to the incident sound field in Eq. (11) cannot model the diffraction around the receiver. Therefore, even utilizing the acoustic reciprocity principle, the GISM can only model the RTFs when either the source or the receiver does not include local diffraction effects.
•

The FSRR method [50] extends the ISM to include the directivity of both the source and receiver using a far-field approximation and, therefore, cannot model near-field directivities.

III.2 Source and receiver directivities

The directivity of a transducer mounted on a device can either be numerically simulated or practically measured at discrete points on a sphere around a local origin. When sound waves are generated or captured by the mounted transducers, e.g., a circular piston at $\mathbf{x}_{\text{s}}$ or a microphone at $\mathbf{x}_{\text{r}}$ as shown in Fig. 2, acoustic diffraction occurs due to the wave interaction with the speakers. As shown in previous works [55, 60, 51], the total sound radiation of a source device can be quantified using directivity coefficients defined on a transparent sphere with radius $r_{0}^{(\text{s})}$ . Using reciprocity, the receiver directivity can be obtained analogously to the source directivity [51]. As an example, consider the source device. If the sound field is sampled at $J$ discrete points on the transparent sphere, one can reformulate Eq. (10) with truncated order $N$ as

P(\mathbf{r}_{j},k)=\sum_{n=0}^{N}\sum_{m=-n}^{n}P_{n,m}(k,r_{0}^{(\text{s})})Y_{n,m}(\theta_{j},\phi_{j})~{},

(16)

where $\mathbf{r}_{j}$ denotes the the $j$ th position on the transparent sphere, $(\theta_{j},\phi_{j})$ the corresponding direction, $P_{n,m}(r_{0}^{(\text{s})},k)=C^{(\text{s})}_{n,m}(k)h_{n}(kr_{0}^{(\text{s})})$ is the spherical wave spectrum [55] and $j\in\{1,2,\ldots,J\}$ . Different choices of the distribution of $(\theta_{j},\phi_{j})$ can be found in the literature [59]. If the number of samples is larger than $(N+1)^{2}$ , the spherical wave spectrum can be calculated from the sampled sound field by solving an over-determined linear system using a least-squares approach. The directivity coefficients of the source can then be determined by $C^{(\text{s})}_{n,m}(k)=P_{n,m}(r_{0}^{(\text{s})},k)/h_{n}(k\,r_{0}^{(\text{s})})$ . The receiver directivity coefficients $C^{(\text{r})}_{n,m}(k)$ can be obtained following the same procedure by making use of reciprocity [51].

III.3 Room transfer function

The following parameterization of the transfer function between two transducers with arbitrary directivities in the free field, $H^{(\text{F})}$ , was proposed by Xu et al. [51]:

	$\displaystyle H^{(\text{F})}(\mathbf{x}_{\text{r}}\|\mathbf{x}_{\text{s}},k)=$	$\displaystyle\sum_{v=0}^{V}\sum_{u=-v}^{v}\sum_{n=0}^{N}\sum_{m=-n}^{n}C^{(\text{s})}_{n,m}(k)$
		$\displaystyle\times\alpha_{v,u}^{n,m}(k)D^{(\text{r})}_{v,u}(k)~{},$		(17)

where $\alpha_{v,u}^{n,m}(k)$ represent the free-field mode coupling coefficients. Note that the two transparent spheres shown in Fig. 2 are nonoverlapping in this work. However, when both the source and receiver are on the same device, one can measure or simulate the direct path to avoid the overlapping of the two transparent spheres. The following relation between $D^{(\text{r})}_{v,u}(k)$ and the directivity coefficients $C^{(\text{r})}_{v,u}(k)$ obtained via reciprocity for the receiver was derived in [51]

D_{v,u}^{(\text{r})}(k)=\frac{i(-1)^{u}}{k}C_{v,-u}^{(\text{r})}(k)~{}.

(18)

Since the summation over source images is a linear operation and the mirroring effects are already included in $\gamma_{v,u}^{n,m}$ in Eq. (12), the total RTF can be written by substituting $\gamma_{v,u}^{n,m}(k)$ for $\alpha_{v,u}^{n,m}(k)$ in Eq. (17), yielding:

	$\displaystyle H_{\text{DEISM}}(\mathbf{x}_{\text{r}}\|\mathbf{x}_{\text{s}},k)=$	$\displaystyle\sum_{v=0}^{V}\sum_{u=-v}^{v}\sum_{n=0}^{N}\sum_{m=-n}^{n}C^{(\text{s})}_{n,m}(k)$
		$\displaystyle\times\gamma_{v,u}^{n,m}(k)D^{(\text{r})}_{v,u}(k)~{}.$		(19)

The modified reverberant mode coupling coefficients can be defined, based on Eq. (12) and the reciprocal relation in Eq. (18), as follows:

\tilde{\alpha}_{v,u}^{n,m}(k)=\frac{i(-1)^{u}}{k}\gamma_{v,u}^{n,m}(k)~{}.

(20)

The final proposed RTF is therefore expressed as

	$\displaystyle H_{\text{DEISM}}(\mathbf{x}_{\text{r}}\|\mathbf{x}_{\text{s}},k)=$	$\displaystyle\sum_{v=0}^{V}\sum_{u=-v}^{v}\sum_{n=0}^{N}\sum_{m=-n}^{n}C^{(\text{s})}_{n,m}(k)$
		$\displaystyle\times\tilde{\alpha}_{v,u}^{n,m}(k)C^{(\text{r})}_{v,-u}(k)~{},$		(21)

where, for clarity, the proposed method is identified using the acronym Diffraction Enhanced Image Source Method (DEISM).

The GISM RTF, i.e., the simulation of a directional source to a local receiving region, can be recovered from the proposed RTF as follows. Similarly to Xu et al. [51], for a receiver region without any physical objects and an observation point $\mathbf{y}^{(\text{r})}$ with spherical coordinates $(d_{y}^{(\text{r})},\theta_{y}^{(\text{r})},\phi_{y}^{(\text{r})})$ relative to $\mathbf{x}_{\text{r}}$ , as shown in Fig. 1, the receiver directivity coefficients can be written as

C_{v,u}^{(\text{r})}=-ikj_{v}(kd_{y}^{(\text{r})})Y_{v,u}^{*}(\theta_{y}^{(\text{r})},\phi_{y}^{(\text{r})})~{}.

(22)

By substituting Eq. (22) into Eq. (21), and utilizing the property of spherical harmonics $Y_{v,u}^{*}(\theta_{y}^{(\text{r})},\phi_{y}^{(\text{r})})=(-1)^{u}Y_{v,-u}(\theta_{y}^{(\text{r})},\phi_{y}^{(\text{r})})$ , one can obtain the same expression of the RTFs as the GISM shown in Eq. (11).

III.4 Far-field approximation

\fig

Figure3.eps

Figure 3: An example of reversed reflection paths along the

x-

direction used in the ISM. For an odd number of reflections,

\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{1},\mathbf{0}}

and

\mathbf{R}^{\text{s}\to\text{rI}}_{\mathbf{1},\mathbf{0}}

share the same-length, but reversed reflection path. However, for an even number of reflections, e.g.,

\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{0},\mathbf{-1}}

, the reversed path

\mathbf{R}^{\text{s}\to\text{rI}}_{\mathbf{0},\mathbf{1}}

is symmetric w.r.t the original room cell instead of

\mathbf{R}^{\text{s}\to\text{rI}}_{\mathbf{0},\mathbf{-1}}

The computational cost of the DEISM RTF is expected to increase with frequency since the directivity patterns tend to be more complicated at higher frequencies [61], resulting in higher orders of the directivities, i.e., larger $N$ and $V$ , required in the DEISM. As either $N$ or $V$ increases, the number of summations over $l$ in the mode coupling coefficients in Eq. (12) increases, thus consuming more computational resources. Therefore, it is necessary to investigate the possibility of simplifying the DEISM in Eq. (21) while maintaining sufficient accuracy. To this end, a far-field approximation is applied to each reflection path of the RTF in Eq. (21) in this section.

Without loss of generality, we consider a reflection path from the source image to the receiver, denoted by the vector $\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ , for the following analysis. The mode coupling coefficients from Eq. (20) for a single reflection path can be expressed as follows,

$\displaystyle\tilde{\alpha}_{v,u}^{n,m}(k)=$	$\displaystyle\beta(\mathbf{p},\mathbf{q})\varLambda(\mathbf{p},m,n)\frac{4\pi i}{k}i^{v-n}$	(23)
	$\displaystyle\times(-1)^{m^{\prime}+u}\displaystyle\sum_{l=\|n-v\|}^{n+v}i^{l}h^{(2)}_{l}(k\|\|\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}\|\|)$
	$\displaystyle\times Y_{l,m^{\prime}-u}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})W_{1}W_{2}\xi~{},$	(24)

where $m^{\prime}=(-1)^{p_{x}+p_{y}}m$ and $\varLambda(\mathbf{p},m,n)=(-1)^{(p_{y}+p_{z})m+p_{z}n}$ are the additional effect of the mirroring on the spherical harmonics [49]. Assuming $||\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}||$ is large, we employ the large argument of the spherical Hankel function, i.e., $h^{(2)}_{l}(k||\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}||)\approx i^{l+1}e^{-ik||\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}||}/k||\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}||$ . The modified mode coupling coefficients in far field can be expressed as

$\displaystyle\tilde{\alpha}_{v,u}^{n,m}(k)\approx$	$\displaystyle\beta(\mathbf{p},\mathbf{q})\varLambda(\mathbf{p},m,n)\frac{4\pi}{k}i^{v-n}$
	$\displaystyle\times(-1)^{m^{\prime}+u+1}\frac{e^{-ik\|\|\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}\|\|}}{k\|\|\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}\|\|}\displaystyle\sum_{l=\|n-v\|}^{n+v}(-1)^{l}$
	$\displaystyle\times Y_{l,m^{\prime}-u}(\Omega_{\mathbf{p},\mathbf{q}})W_{1}W_{2}\xi~{}.$	(25)

Note that the product of two spherical harmonics can be expressed using the formula [62],

	$\displaystyle Y_{n,m}(\theta,\phi)Y_{v,-u}(\theta,\phi)=(-1)^{n+m+v-u}$
	$\displaystyle\times\displaystyle\sum_{l=\|n-v\|}^{n+v}(-1)^{l}Y_{l,m-u}(\theta,\phi)W_{1}W_{2}\xi~{}.$		(26)

By applying Eq. (26) to Eq. (25), we can express the mode coupling coefficients as follows,

$\displaystyle\tilde{\alpha}_{v,u}^{n,m}(k)\approx$	$\displaystyle\beta(\mathbf{p},\mathbf{q})\varLambda(\mathbf{p},m,n)\frac{4\pi}{k}i^{v-n}$
	$\displaystyle\times(-1)^{n+v-1}\frac{e^{-ik\|\|\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}\|\|}}{k\|\|\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}\|\|}$
	$\displaystyle\times Y_{n,m}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})Y_{v,-u}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})~{}.$	(27)

The transfer function of one reflection path using the far-field approximation model can thus be written as,

	$\displaystyle H^{\mathbf{p},\mathbf{q}}_{\text{DEISM-LC}}(\mathbf{x}_{\text{r}}\|\mathbf{x}_{\text{s}},k)\approx-\beta(\mathbf{p},\mathbf{q})\frac{4\pi}{k}\frac{e^{-ik\|\|\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}\|\|}}{k\|\|\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}\|\|}$
	$\displaystyle\times\sum_{n=0}^{N}\sum_{m=-n}^{n}\varLambda(\mathbf{p},m,n)i^{-n}(-1)^{n}C^{(\text{s})}_{n,m}(k)Y_{n,m^{\prime}}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})$
	$\displaystyle\times\sum_{v=0}^{V}\sum_{u=-v}^{v}i^{v}(-1)^{v}C^{(\text{r})}_{v,-u}(k)Y_{v,-u}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})~{},$		(28)

where DEISM-LC is used to refer to the Low-Complexity (LC) solution.

The terms $\varLambda(\mathbf{p},m,n)Y_{n,m^{\prime}}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})$ are a result of mirroring of spherical harmonics. Since the source images are used in Eq. (28), the additional factors $\varLambda(\mathbf{p},m,n)$ and $m^{\prime}$ need to be associated with the source spherical harmonics. To further simplify the expressions in Eq. (28), we propose to replace $\varLambda(\mathbf{p},m,n)Y_{n,m^{\prime}}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})$ by a simplified expression, which can be achieved as shown in the following by using the direction that corresponds to the reversed traveling path of the vector $\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ .

First, we note that the vector pointing from the receiver to the source image $\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}$ is the inverse of the vector $\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ given in Eq. (4). In addition, we define the vector pointing from the source to the receiver images $\mathbf{R}^{\text{s}\to\text{rI}}_{\mathbf{p},\mathbf{q}}$ , which can be expressed similarly to Eq. (4) as

\displaystyle\mathbf{R}^{\text{s}\to\text{rI}}_{\mathbf{p},\mathbf{q}}=

\displaystyle\begin{bmatrix}x_{\text{r}}-x_{\text{s}}-2p_{x}x_{\text{r}}+2q_{x}L_{x}\\ y_{\text{r}}-y_{\text{s}}-2p_{y}y_{\text{r}}+2q_{y}L_{y}\\ z_{\text{r}}-z_{\text{s}}-2p_{z}z_{\text{r}}+2q_{z}L_{z}\end{bmatrix}~{}.

(29)

In spherical coordinates, the vectors $\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}$ and $\mathbf{R}^{\text{s}\to\text{rI}}_{\mathbf{p},\mathbf{q}}$ have directions $\Omega^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}$ and $\Omega^{\text{s}\to\text{rI}}_{\mathbf{p},\mathbf{q}}$ , respectively. These two vectors are reversed reflection paths only when there are an odd number of reflections between parallel walls, as shown by the gray arrows in Fig. 3. To take the even number of reflections into account, as shown by the black arrows in Fig. 3, we use the modified vector $\mathbf{R}^{\text{s}\to\text{rI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}}$ (with directions $\Omega^{\text{s}\to\text{rI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}}$ ) that represents the reversed path of $\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}$ with its subscript indices $\tilde{\mathbf{p}},\tilde{\mathbf{q}}=(\tilde{p}_{x},\tilde{p}_{y},\tilde{p}_{z}),(\tilde{q}_{x},\tilde{q}_{y},\tilde{q}_{z})$ taking the following expressions

\displaystyle\tilde{p}_{a},\tilde{q}_{a}=\begin{cases}p_{a},q_{a}&\text{if }|2q_{a}-p_{a}|\mod 2=1\\ p_{a}^{\prime},q_{a}^{\prime}&\text{if }|2q_{a}-p_{a}|\mod 2=0\end{cases}~{},

(30)

where $a\in\{x,y,z\}$ , and $p_{a}^{\prime},q_{a}^{\prime}$ represent the image indices that are symmetric to $p_{a},q_{a}$ w.r.t the image indices $p_{a},q_{a}=0,0$ . The relation between $p_{a}^{\prime},q_{a}^{\prime}$ and $p_{a},q_{a}$ can be summarized as

	$\displaystyle 2q_{a}-p_{a}$	$\displaystyle=p_{a}^{\prime}-2q_{a}^{\prime}$		(31)
	$\displaystyle q_{a}^{\prime}$	$\displaystyle=\Big{\lfloor}\frac{p_{a}-2q_{a}}{2}\Big{\rfloor}~{},$		(32)

where $\lfloor\cdot\rfloor$ denotes the floored division.

Secondly, due to the properties of the mirroring effect on the spherical harmonics, one can write the following relation between the spherical harmonics associated with the spherical direction $\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ in Eq. (28) and $\Omega^{\text{s}\to\text{rI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}}$ introduced above

\varLambda(\mathbf{p},m,n)Y_{n,m^{\prime}}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})=Y_{n,m}(\Omega^{\text{s}\to\text{rI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}})~{}.

(33)

Since $\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}=-\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ , i.e., $\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}$ and $\Omega^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}$ have opposite directions, the following relation is obtained

Y_{v,-u}(\Omega^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}})=(-1)^{v}Y_{v,-u}(\Omega^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}})~{}.

(34)

Finally, by substituting Eqs. (33) and (34) in the approximated solution in Eq. (28), we obtain the simplified expression

	$\displaystyle H^{\mathbf{p},\mathbf{q}}_{\text{DEISM-LC}}(\mathbf{x}_{\text{r}}\|\mathbf{x}_{\text{s}},k)\approx-\beta(\mathbf{p},\mathbf{q})\frac{4\pi}{k}\frac{e^{-ik\|\|\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}\|\|}}{k\|\|\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}\|\|}$
	$\displaystyle\times\sum_{n=0}^{N}\sum_{m=-n}^{n}i^{n}C^{(\text{s})}_{n,m}(k)Y_{n,m}(\Omega^{\text{s}\to\text{rI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}})$
	$\displaystyle\times\sum_{v=0}^{V}\sum_{u=-v}^{v}i^{v}C^{(\text{r})}_{v,u}(k)Y_{v,u}(\Omega^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}})~{}.$		(35)

The RTF of the DEISM-LC is then given by

	$\displaystyle H_{\text{DEISM-LC}}(\mathbf{x}_{\text{r}}\|\mathbf{x}_{\text{s}},k)=\sum_{\mathbf{p}\in\mathcal{P}}\sum_{\mathbf{q}\in\mathcal{Q}}-\beta(\mathbf{p},\mathbf{q})\frac{4\pi}{k}\frac{e^{-ik\|\|\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}\|\|}}{k\|\|\mathbf{R}^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}}\|\|}$
	$\displaystyle\times\sum_{n=0}^{N}\sum_{m=-n}^{n}i^{n}C^{(\text{s})}_{n,m}(k)Y_{n,m}(\Omega^{\text{s}\to\text{rI}}_{\tilde{\mathbf{p}},\tilde{\mathbf{q}}})$
	$\displaystyle\times\sum_{v=0}^{V}\sum_{u=-v}^{v}i^{v}C^{(\text{r})}_{v,u}(k)Y_{v,u}(\Omega^{\text{r}\to\text{sI}}_{\mathbf{p},\mathbf{q}})~{}.$		(36)

The DEISM-LC is more computationally efficient since it does not require the summation over $l$ in the mode coupling coefficient in Eq. (12). The DEISM in Eq. (21) has a time complexity of $\mathcal{O}(IN^{2}V^{2}(N+V))$ per wavenumber $k$ , where $I$ is the number of images, $N$ and $V$ are the maximum SH order used to describe directivities of the source and receiver. However, the time complexity of DEISM-LC is reduced to $\mathcal{O}(IN^{2}V^{2})$ per wavenumber $k$ , which can lead to approximately $N+V$ times faster computations compared to DEISM. As the frequency increases, larger $N$ or $V$ is usually necessary to accurately describe the directivity. Therefore, one might find DEISM-LC more practical for simulating high frequencies.

While the simplified method described here is similar to that of the FSRR method, we note that there are significant differences between Eq. (36) and Eq. (15). As stated in Sec. III.1, the FSRR utilizes directivity coefficients $\tilde{C}^{(\text{s})}_{n,m}(k),\tilde{C}^{(\text{r})}_{v,u}(k)$ that are obtained at $1$ m from the source or the receiver. However, the directivity coefficients used in DEISM-LC are not limited to this assumption as long as the transparent spheres enclose the devices. Furthermore, there are inherent dissimilarities between DEISM-LC and FSRR due to the different coefficients, e.g., $B_{\mathbf{p},\mathbf{q}},M_{\mathbf{p},\mathbf{q}}$ used in Eq. (15) and $i^{n},i^{v}$ used in Eq. (36). This difference is also illustrated by an example described in Sec. IV.2.

III.5 Algorithm Summary

A summary of the algorithm proposed in Sec. III.3 is listed as pseudocode in 1, 2 and 3. A Python implementation of the algorithm, along with an example, is available online at https://github.com/audiolabs/DEISM. The pseudocode of the far-field approximation method is not provided here for brevity.

Algorithm 1 Calculate lookup tables

1:Calculate the Wigner-3j symbols

W_{1}(n,v,l)

and

W_{2}(n,v,l,m,u)

from Eq. (14).

2:Calculate the spherical harmonic coefficients

P_{n,m}(k,r_{0})

shown in Eq. (16) for both the source and receiver using the least squares approach.

3:Calculate the directivity coefficients of the source and receiver, viz.

C^{(\text{s})}_{n,m}(k)

and

C^{(\text{r})}_{n,m}(k)

, using the relation

P_{n,m}(r_{0},k)=C_{n,m}(k)h_{n}(kr_{0})

. Then store the directivity coefficients as matrices

C^{(s)}(n,m,k)

and

C^{(r)}(n,m,k)

The calculation of the Wigner 3j symbols can be computationally expensive if they are computed for each reflection path. Since they are identical for each reflection path, to reduce the computational effort, we precalculate them for all the given spherical harmonic orders and modes $n,m,v,u,l$ , and store them in a lookup table. In addition, the directivity coefficients $C^{(\text{s})}_{n,m}(k)$ and $C^{(\text{r})}_{n,m}(k)$ are precalculated and stored in a lookup table. The procedures for generating the lookup tables are listed in Alg. 1.

In the provided implementation, we separate the whole procedure of the ISM into two parts, viz. the image generation given in Alg. 2 and the single-path transfer function calculation using DEISM given in Alg. 3. The computation in the spherical harmonic domain is generally also expensive, and therefore we use parallel computation for all reflection paths after all images are calculated.

Algorithm 2 Image generation

\mathbf{p}=(p_{x},p_{y},p_{z})\in\mathcal{P}=\{0,1\}^{3}

{A triple introduced in Tab. 1}.

\mathbf{q}=(q_{x},q_{y},q_{z})\in\mathcal{Q}=\{-N_{m},...,0,...,N_{m}\}^{3}

{A triple introduced in Tab. 1}.

\mathcal{A}=\mathcal{P}\times\mathcal{Q}

{A set containing all possible images.}

4:for

(\mathbf{p},\mathbf{q})\in\mathcal{A}

5: if

\|2q_{x}-p_{x}\|+\|2q_{y}-p_{y}\|+\|2q_{z}-p_{z}\|\leq N_{o}

then {

N_{o}=

maximum reflection order.}

6: Calculate the vector pointing from the source image to the receiver

\mathbf{R}^{\text{sI}\to\text{r}}_{\mathbf{p},\mathbf{q}}

in Eq. (4) and its spherical coordinates

(d_{x_{0}},\theta_{x_{0}},\phi_{x_{0}})

7: Calculate the incident angles

\theta_{a}(\mathbf{p},\mathbf{q})

w.r.t different walls using Eq. (5).

8: Calculate the reflection coefficients

\beta_{x_{1}},\beta_{x_{2}},\beta_{y_{1}},\beta_{y_{2}},\beta_{z_{1}},\beta_{z_{2}}

using Eq. (6).

9: Calculate the wall attenuation

\beta(\mathbf{p},\mathbf{q})=\beta_{x1}^{|q_{x}-p_{x}|}\beta_{x2}^{|q_{x}|}\beta_{y1}^{|q_{y}-p_{y}|}\beta_{y2}^{|q_{y}|}\beta_{z1}^{|q_{z}-p_{z}|}\beta_{z2}^{|q_{z}|}

10: else

11:

\mathcal{A}=\mathcal{A}\setminus\{(\mathbf{p},\mathbf{q})\}

{Remove current image if it exceeds

N_{o}

12: end if

13:end for

Algorithm 3 DEISM core

P(k)=0

{Initialize the sound field at

\mathbf{x}_{\text{r}}

2:for

(\mathbf{p},\mathbf{q})\in\mathcal{A}

3: for

n=0

N

4: for

m=-n

n

\Upsilon=(-1)^{(p_{y}+p_{z})m+p_{z}n}

{Part of mirror effect of spherical harmonics in Eq. (12).}

m^{\prime}=(-1)^{p_{x}+p_{y}}m

{The modified mode

m

\alpha_{v,u}^{n,m}

in Eq. (12).}

7: for

v=0

V

8: for

u=-v

v

\alpha^{n,m^{\prime}}_{v,u}=0

{Initialize the mode coupling coefficient in Eq. (13).}

10: for

l=|n-v|

n+v

11:

\xi=\sqrt{(2n+1)(2v+1)(2l+1)/4\pi}

12:

\alpha^{n,m^{\prime}}_{v,u}\mathrel{+}=4\pi i^{v-n}(-1)^{m^{\prime}}i^{l}h^{(2)}_{l}(kd_{x_{0}})Y_{l,m^{\prime}-u}(\theta_{x_{0}},\phi_{x_{0}})W_{1}(n,v,l)

13:

\times W_{2}(n,v,l,m^{\prime},u)\xi

{Update the mode coupling coefficient.}

14: end for

15:

P(k)\mathrel{+}=\Upsilon\beta(\mathbf{p},\mathbf{q})C^{(s)}(n,m,k)\alpha^{n,m^{\prime}}_{v,u}\frac{i(-1)^{u}}{k}C^{(r)}(v,u,k)

{Summation over

\mathcal{A}

constitutes the room transfer function in Eq. (21).}

16: end for

17: end for

18: end for

19: end for

20:end for

IV Evaluation of the proposed methods

In the following sections, we describe the experimental setups used to evaluate the performance of DEISM and DEISM-LC (see Eq. (21) and Eq. (36), respectively). The simulated RTFs are plotted and compared for different scenarios. The differences between the DEISM and the FEM solutions are evaluated using error metrics, and the difference between DEISM and DEISM-LC is analyzed. In addition, the effects of incorporating directivities into the RTFs by comparing the DEISM-LC with the original ISM, and the differences between DEISM-LC and FSRR are also demonstrated by comparing them with FEM.

IV.1 Test setup

The proposed methods require a description of the directivities of the devices, i.e., a loudspeaker (or microphone) in the spherical harmonic domain. These data can be obtained from measurements of the pressure field in the local vicinity of the transducer and its enclosure. Alternatively, directivity patterns obtained from simulations could be used. In this work, the FEM is used to simulate the free-field pressure fields surrounding the transducer and its enclosure. Aretz et al. [37] have demonstrated that solutions of the ISM with angle-dependent reflection coefficients are in good agreement with FEM solutions above the Schroeder frequency. Therefore, the FEM is used in this work to evaluate the performance of the proposed method.

This section presents the details of generating the simulated transducer directivities and the RTFs.

IV.1.1 Transducer directivities

\figline

\fig

Figure4a.png0.15 \figFigure4b.png0.1 \figFigure4c.png0.1

Figure 4: Speaker shapes considered in this work. From left to right, these speakers are referred to in the text as: cuboidal (cub.), spherical (sph.) and cylindrical (cyl.), where the terms in brackets are abbreviations also used in this work. The three speakers share the same height and the same-sized vibrating piston.

The three speaker shapes considered in this work are shown in Fig. 4. For each speaker, the loudspeaker driver is modeled as a vibrating piston with frequency-independent unit acceleration. For each microphone, a monopole source with frequency-independent unit acceleration is placed at the microphone position, which is located on the top and close to the front of each enclosure.

One must ensure that free-field conditions are used to capture the loudspeaker and microphone directivities. In a measurement setup, one might use an anechoic chamber. When using the FEM, a suitable non-reflecting boundary condition is required. In this work, a perfectly matched layer [63] is used to reduce unwanted reflections from the edge of the computational domain. Note that care must be taken to ensure that the perfectly matched layer sufficiently reduces the unwanted reflections - this is achieved in this work by performing a priori mesh studies. Each speaker is meshed using triangular elements, the surrounding air (with a sound speed of 343 m/s and density of 1.2 kg/m³) is meshed using tetrahedral elements, and the perfectly matched layer is meshed using hexahedral elements. An average element size that ensures that there are ten degrees of freedom (or five quadratic elements) per wavelength is used. Quadratic shape functions are used. The free-field transfer function is simulated from 20 Hz to 1 kHz with steps of 2 Hz.

For each speaker, two sizes are considered, which are referred to as Size 1 and Size 2, where Size 2 is the smallest. The sound pressure field is simulated on a transparent sphere around the speaker for each size and shape. Note that, for a given speaker shape, the radius of the transparent sphere changes depending on the size of the speaker and the position of the transducer on the speaker; The reason being that we place the transducer at the center of the coordinate system. For the source on the speakers of Size 1, the radius of the transparent sphere is 0.4 m, while for Size 2, it is 0.2 m. For the receivers on the speakers, a radius of 0.5 m is used for Size 1, and a radius of 0.25 m is used for Size 2.

The complex-valued pressure field simulated on the surface of the transparent sphere is used to compute the spherical harmonic directivity coefficients. For the loudspeaker directivity, this involves a straightforward implementation of spherical harmonic decomposition [55]. To obtain the receiver directivities, we use acoustic reciprocity [51] (see Sec. III.2).

\figline

\fig

Figure5a.eps0.175Sound pressure level (dB) \figFigure5b.eps0.175Phase in radians ( $\pi$ )

Figure 5: Plots of sound pressure level and phase reconstructed using simulated speaker directivity coefficients of a large (Size 1) spherical and cuboidal speaker. The pressure field is reconstructed with maximum spherical harmonic order

5

. The results are evaluated on the XY-plane at three frequencies.

An example of the directivity patterns is shown in Fig. 5, where the Sound Pressure Level (SPL) and phase are plotted for the spherical and cuboidal speakers with Size 1. The vibrating piston faces the positive direction of the $x$ -axis, which is hereafter referred to as $+x$ . The pressure field on the XY-plane is reconstructed using Eq. (16) using simulated directivity coefficients $C^{(\text{s})}_{n,m}$ with maximum spherical harmonic order $N=5$ . It can be seen that both the SPL and phase plots show differences between the two selected speaker shapes, especially at angles that are in the shadow region of the speakers. One can also observe a significant phase jump in the shadow region when a cuboidal speaker is used, which is less prominent for a spherical speaker.

IV.1.2 Room transfer functions

A rectangular room with dimensions $L_{x}\times L_{y}\times L_{z}=4~{}\text{m}\times 3~{}\text{m}\times 2.5~{}\text{m}$ is considered. The air in the room has a speed of sound of 343 m/s and a density of 1.2 kg/m³. A uniform frequency-independent normalized impedance of 18, which gives an absorption coefficient of approximately 0.2, is specified at the walls of the room. This value is chosen because it gives a reverberation time of approximately 0.29 s at the fundamental frequency of 43.88 Hz, which gives a Schroeder frequency of approximately 198 Hz. The reverberation time was estimated using the modal model described by Morse and Bolt [64]. Note that, in general, the surface impedance is a complex-valued frequency-dependent quantity. Our choice of a real-valued frequency-independent impedance does not limit the generality of the method described here.

Five source-receiver configurations are simulated. The center position of each transducer, for each configuration, is given in Table 2. Four of the configurations contain two devices, one with a loudspeaker and the other with a microphone. The remaining configuration 3 has a single device on which the loudspeaker and microphone are placed. The transfer functions of the direct path between the source and receiver are simulated using the FEM for Configuration 3. The microphone position $\mathbf{x}^{\text{C3}}_{\text{r}}$ in Table 2 depends on the size, shape, and rotation of the speaker and takes the values given in Table 3. A 2D illustration of the speaker position and facing configurations is shown in Fig. 6. The source and receiver are both at least $1$ m away from the walls in Positions $1-3$ . Positions $4-5$ allow a closer distance between the walls and at least one speaker to examine the performance of the DEISM in challenging configurations.

Table 2: Configurations of the transducers and the corresponding speaker orientations used in the room simulation. The position vector

\mathbf{x}^{\text{C3}}_{\text{r}}

for Configuration 3 is listed for different shapes and sizes in Tab. 3.

Configuration id	source, orientation	receiver, orientation
1	[1.1, 1.1, 1.3], $+x$	[2.9, 1.9, 1.3], $-x$
2	[1.1, 1.1, 1.3], $+x$	[1.9, 1.6, 1.4], $-x$
3	[1.1, 1.1, 1.3], $+x$	$\mathbf{x}^{\text{C3}}_{\text{r}}$ , $+x$
4	[0.4, 1.1, 1.3], $+x$	[2.1, 1.6, 1.3], $-x$
5	[0.4, 1.1, 1.3], $+x$	[2.5, 2.6, 1.3], $-y$

Table 3: Positions

\mathbf{x}^{\text{C3}}_{\text{r}}

for simulations with the source and receiver on the same speaker (Configuration 3).

Speaker shape	Size 1 (large)	Size 2 (small)
Cuboid	[1.05, 1.1, 1.5]	[1.075, 1.1, 1.4]
Spherical	$[1.05,1.1,1.473]$	$[1.075,1.1,1.387]$
Cylindrical	[1.05, 1.1, 1.5]	[1.075, 1.1, 1.4]

\fig

Figure6.eps0.4

Figure 6: A 2D view of the positions of the mounted transducers inside the room. The black dots and gray dots represent the approximate positions of the source and receiver transducers, respectively. The arrows of the dots indicate the directions in which the speakers face, and the numbers indicate the source and receiver positions.

Each model is meshed using tetrahedral elements with an average size that ensures that 10 degrees of freedom per wavelength are used. Quadratic shape functions are used. The room transfer functions for each configuration are simulated, from 20 Hz to 1 kHz, in steps of 2 Hz.

IV.2 Preliminary comparisons with ISM and FSRR

In this section, we compare the DEISM-LC to existing RTF simulation methods. To this end, we choose the original ISM with omnidirectional transducers (referred to as ISM-Omni) and the FSRR method, as introduced in Eq.(1) and Eq.(15), respectively. We demonstrate the effects of incorporating transducer directivities into the RTFs by comparing the ISM-Omni with the DEISM-LC. We are also interested in the differences between the FSRR method and DEISM-LC as they are computationally more efficient compared to the DEISM. In Fig. 7, the magnitudes and phases of the RTFs using the ISM-Omni, DEISM-LC, FSRR, and FEM are plotted. As an example, the results are shown for large cuboidal speakers in Configuration 2 with maximum reflection order 25 and maximum spherical harmonic order 5 for both the source and receiver. The magnitudes of the RTFs are given in terms of SPL in decibels and the unwrapped phases in radians. Unless specified, the magnitude and phase responses use the same units in the subsequent sections. The frequency-dependent SPL is defined as [53]

\text{SPL}(k)=20\log_{10}\bigg{(}\frac{P_{\text{rms}}(k)}{P_{0}}\bigg{)}~{},

(37)

where $P_{\text{rms}}$ is the root-mean-square pressure calculated from the RTFs and $P_{0}=2\times 10^{-5}~{}\text{Pa}$ is the reference pressure. The FEM solutions are presumed as the ground truth of the RTFs in this section and throughout the following evaluations.

\figcolumn

\fig

Figure7a.eps0.5 \figFigure7b.eps0.5

Figure 7: A comparison of the magnitudes and phase responses between the ISM-Omni, DEISM-LC, FSRR, and FEM solutions. The results are shown for large cuboidal speakers with Configuration 2. Max. reflection order is 25 and max. SH orders are 5 for both the source and receiver.

The RTFs of the ISM-Omni were generated using an omnidirectional source and receiver at the same locations as those used in Configuration 2. The directivity coefficients of the source and receiver are the same and contain only a monopole component based on Eq. (22), i.e.,

C^{(\text{s})}_{0,0}(k)=C^{(\text{r})}_{0,0}(k)=-ikj_{0}(0)Y_{0,0}^{*}(\theta,\phi)=\frac{-ik}{\sqrt{4\pi}}~{},

(38)

where $(\theta,\phi)$ denotes the directions in spherical coordinates of either the point source or the omnidirectional receiver relative to their local origins. For the RTFs of the FSRR given in Eq.(15), the directivity coefficients $\tilde{C}^{(\text{s})}_{n,m}(k),\tilde{C}^{(\text{r})}_{n,m}(k)$ are extrapolated from $C^{(\text{s})}_{n,m}(k),C^{(\text{r})}_{v,u}(k)$ to the measurement sphere with a radius of $1$ m, and we choose $M_{\mathbf{p},\mathbf{q}}=\beta(\mathbf{p},\mathbf{q})$ .

Overall, the differences between the DEISM-LC and FEM are small compared to those between the ISM-Omni and FEM and those between the FSRR and FEM. This indicates that the DEISM-LC can generate more realistic RTFs when compared to the other image source methods for the considered scenario. The large offset in magnitude responses between the DEISM-LC and ISM-Omni mainly come from the different source types, i.e., a point source used in ISM-Omni and a vibrating piston used in DEISM-LC. However, the locations and Q-factors of the resonant peaks and the phase responses also vary greatly. This shows that the directivities of the source and receiver, including sound diffraction around the speakers, can affect the RTFs to a large extent.

The differences between the DEISM-LC and FSRR are mainly attributed to the inherent mismatch in their RTF parameterizations, i.e., between Eq.(36) and Eq.(15). The magnitude responses are different not only in the amplitude but also in the locations and Q-factors of the resonant peaks. The phase response of the FSRR even contains positive values at lower frequencies corresponding to non-causal behaviors, which results from the random sign $B_{\mathbf{p},\mathbf{q}}$ used in Eq. (15).

Since the inherent differences between DEISM-LC and other methods, i.e., the ISM-Omni and FSRR, are shown to be large, we limit the following evaluations to DEISM, DEISM-LC, and FEM.

IV.3 Varying distance between source and receiver

\figcolumn

\fig

Figure8a.eps0.5\figFigure8b.eps0.5

Figure 8: Magnitudes and phase responses for small spherical speakers with varying distance between speakers. Max. reflection order is 25 and max. SH orders are 5 for both the source and receiver. The magnitudes and phases are shifted by vertical arrows for better distinction of the curves.

In this section, we present a comparison between the FEM and DEISM solutions for the small spherical speaker when the distance between the source and receiver is changing, corresponding to Configuration 1-3 in Tab. 2. The two distances in Configuration 1-2 are around $1.97$ m, $0.95$ m and the distances in Configuration 3 are around $0.1$ m and $0.2$ m for the small and large speakers, respectively. The magnitude and phase are compared in Fig. 8. To avoid visual overlap of the curves for different configurations, the SPLs and phases are plotted by shifting the curves vertically, as denoted by the arrows.

We see that good agreement is obtained between the FEM and DEISM solutions. In general, greater differences are found at the notches of the RTFs, than at the peaks. Additionally, the differences between DEISM and DEISM-LC solutions are much smaller than those between the FEM and DEISM solutions in the notches of the RTFs. Since Configuration 3 contains only one speaker in the room, the corresponding RTFs show fewer variations across frequencies compared to the other two configurations. Similar agreement between FEM and DEISM was found for the other speaker shapes. For brevity, the respective plots are not presented here.

IV.4 Source and receiver on the same device

Next, we compare solutions for cases where the source and receiver are placed on the same device (Configuration 3). For this comparison, we analyze data for the three speaker shapes considered.

The respective magnitude and phase solutions are shown in Fig. 9. Once again, good agreement is obtained between the FEM and DEISM solutions. The magnitudes and phases of the RTFs of the cuboidal and cylindrical speakers show very similar behaviour. In contrast, those of the spherical speaker exhibit clear differences when compared to the RTFs of the other speakers. Note that the main differences appear at the notches of the RTF magnitudes, which correspond to frequencies around $444$ Hz and $650$ Hz. In addition, the phase curves of the cuboidal and cylindrical speakers also exhibit significant jumps at some frequencies, which can not be observed for the spherical speaker. The differences between the curves indicate that the edges of the cuboidal and cylindrical speakers can induce additional diffraction and lead to distinct characteristics in the RTFs.

\figcolumn

\fig

Figure9a.eps0.5\figFigure9b.eps0.5

Figure 9: Magnitudes and phase responses for large speakers with Configuration 3 while varying shapes of the speakers. Max. reflection order is 25 and max. SH. orders are 5 for both the source and receiver. The magnitudes and phases are shifted by verticle arrows for better distinction of the curves.

IV.5 Quantitative analysis

In this section, we quantify the differences between the FEM and DEISM/DEISM-LC solutions and DEISM and DEISM-LC solutions. The following error metrics are used for the comparison: The root-mean-square log spectral distance [65]

e_{\text{lsd}}=\sqrt{\frac{1}{K}\displaystyle\sum_{k\in\mathcal{K}}\Bigg{|}10\log_{10}\bigg{(}\frac{|H_{\text{Test}}(k)|}{|H_{\text{FEM}}(k)|}\bigg{)}^{2}\Bigg{|}^{2}}~{},

(39)

and the normalized root-mean-square phase error in radians [47]

e_{\text{p}}=\sqrt{\frac{1}{K}\displaystyle\sum_{k\in\mathcal{K}}\big{(}\angle H_{\text{Test}}(k)-\angle H_{\text{FEM}}(k)\big{)}^{2}}~{},

(40)

where $\mathcal{K}$ is the set of the $K$ wavenumbers used in simulations, $H_{\text{Test}}(k)$ is the simulated RTF of either the DEISM or DEISM-LC at wavenumber $k$ , and $H_{\text{FEM}}(k)$ is the simulated RTF of the FEM at wavenumber $k$ . The log spectral distance and the phase errors are presented for different positions, speaker shapes, and sizes in Tables 4 and 5, respectively. In both tables, lower values indicate better performance. General observations can be made for Positions 1-3 in the two tables:

•

The smaller the speaker is, the smaller the difference between the DEISM and FEM solutions.
•

The fewer speakers there are, i.e., for Configuration 3, the smaller the difference.

Table 4: Root-mean-square log spectral distance in decibel between DEISM and FEM, and between DEISM-LC and FEM. (s. denotes small.)

config. id	cub.	s. cub.	sph.	s. sph.	cyl.	s. cyl.
1 DEISM	1.918	1.214	2.268	1.143	2.062	0.950
1 DEISM-LC	1.669	1.070	2.206	1.125	2.007	0.916
2 DEISM	1.974	1.051	2.521	1.003	1.900	0.964
2 DEISM-LC	1.855	1.003	2.060	0.984	1.809	0.943
3 DEISM	0.423	0.204	0.841	0.170	0.920	0.154
3 DEISM-LC	0.427	0.206	0.883	0.174	0.952	0.155
4 DEISM	7.018	8.187	6.457	7.418	6.350	7.287
4 DEISM-LC	6.864	8.149	6.461	7.298	6.240	7.211
5 DEISM	6.887	6.810	7.077	6.309	7.011	6.425
5 DEISM-LC	6.792	6.807	6.986	6.277	6.968	6.397

Table 5: Normalized root-mean-square phase error in radians between DEISM and FEM, and between DEISM-LC and FEM. (s. denotes small.)

config. id	cub.	s. cub.	sph.	s. sph.	cyl.	s. cyl.
1 DEISM	0.258	0.179	0.307	0.174	0.278	0.162
1 DEISM-LC	0.265	0.153	0.300	0.176	0.267	0.181
2 DEISM	0.251	0.138	0.246	0.128	0.255	0.123
2 DEISM-LC	0.246	0.149	0.259	0.142	0.260	0.144
3 DEISM	0.051	0.026	0.104	0.032	0.093	0.024
3 DEISM-LC	0.052	0.026	0.109	0.032	0.097	0.024
4 DEISM	1.796	1.764	1.834	1.776	1.801	1.778
4 DEISM-LC	1.791	1.763	1.828	1.774	1.781	1.776
5 DEISM	1.795	1.926	1.818	1.906	1.822	1.921
5 DEISM-LC	1.791	1.928	1.816	1.907	1.821	1.923

The errors of the DEISM solutions increase when the speakers are located close to one wall of the room, i.e., for Configurations 4 and 5. The large errors may be attributed to the mirror source approximation since the speakers can have more complicated wave interaction with the nearby walls, which is not considered in this model, and can lead to additional RTF fluctuations. For devices with a physical extent, the assumption of a source image might not hold and could require more complex modeling.

In Configuration 3, the DEISM outperforms DEISM-LC for all speaker shapes and sizes. However, this does not necessarily hold for Configurations 1 and 2. This is mainly caused by the notches of the RTF magnitudes, where the DEISM produces much lower values than DEISM-LC, as shown in Fig. 8 and Fig. 9.

To provide insight into the performance of the methods in larger rooms, we consider the case in which the room’s length in the $x$ -dimension is doubled. For this test, only the cylindrical driver is considered in Configurations 1-3. The error levels obtained are presented in Table 6. The error levels for the larger room are similar to the values given in Tables 4 and 5, and imply similar conclusions.

Table 6: Root-mean-square log spectral distance in decibel and normalized root-mean-square phase error in radians between DEISM and FEM, and between DEISM-LC and FEM in a larger room. (s. denotes small.)

	RMS LSD (decibel)		NRMS phase error (radians)
config. id	cyl.	s. cyl.	cyl.	s. cyl.
1 DEISM	1.177	0.967	0.185	0.128
1 DEISM-LC	1.190	1.036	0.227	0.126
2 DEISM	0.987	0.928	0.123	0.131
2 DEISM-LC	1.188	0.972	0.153	0.119
3 DEISM	0.783	0.319	0.151	0.038
3 DEISM-LC	0.893	0.320	0.155	0.038

IV.6 Evaluation of DEISM-LC

\figcolumn

\fig

Figure10.eps0.5

Figure 10: Relative error (Eq. (41)) between the DEISM and DEISM-LC solutions evaluated for the direct path only while changing the distance between the source and receiver. Both the distance and relative errors are plotted on a logarithmic scale.

\figcolumn

\fig

Figure11.eps0.5

Figure 11: Relative error (Eq. (41)) between the DEISM and DEISM-LC solutions as a function of reflection order for the cuboidal speakers at Positions 1, 2, and 3.

In the previous section, we showed that in some cases, DEISM-LC achieves a smaller difference to the FEM than DEISM, especially at the notches of the RTFs. It is worth noting that the frequencies of the notches might not be accurate since the maximum resolution of the frequencies is $2$ Hz. Therefore, directly comparing the DEISMs with the FEM might not be enough to determine whether the simplified DEISM performs worse or better. In the following, we investigate the difference between the DEISMs by computing the relative error using the $\ell^{2}$ norm, defined as

e_{\ell^{2}}=\frac{\|\mathbf{H}_{\text{DEISM}}-\mathbf{H}_{\text{DEISM-LC}}\|_{2}}{\|\mathbf{H}_{\text{DEISM}}\|_{2}}~{},

(41)

where $\mathbf{H}_{(\cdot)}$ is the complex-valued vector with the transfer function values $H_{(\cdot)}(k)$ of all wavenumbers. In this comparison, we consider the error in free-field and reverberant conditions.

The errors in free-field conditions are given in Fig. 10, where the source-receiver distance and the error are plotted on logarithmic scales. One can observe that the difference decreases to around $1\%$ at a distance of $10$ m and to around $0.1\%$ at a distance of $100$ m. It can be concluded that DEISM-LC asymptotically converges to the DEISM solution as the distance between the source and receiver increases, which is expected from the derivations in Sec. III.4.

In Fig. 11, the relative error is evaluated by changing the maximum reflection order in the DEISM for the cuboidal speakers. Note that for the curves of Configuration 3, the data of reflection order zero is missing since the direct path is simulated using the FEM. The difference between the DEISM solutions reduces for the first few reflection orders, but remains constant for higher reflection orders. Similar behaviors were also found for other speaker shapes, which are not presented here for brevity.

IV.7 Summary

The presented comparisons and error analyses show that the proposed method and its simplified version can simulate RTFs that are in good agreement with FEM solutions in Configurations 1-3. Less agreement is found when a speaker is placed close to the walls, i.e., at Configurations 4 and 5. This might be attributed to the comb filtering caused by the direct path and reflection between the wall and speaker when the distance is less than $0.5$ m [66].

The errors between the DEISM and FEM solutions decrease when smaller speakers are used in the simulation. Additionally, the lowest error is obtained when the source and receiver are placed on the same speaker, which reduces the number of speakers in the simulation, and hence the number of reflecting surfaces. These observations support the hypothesis that the errors come mostly from the omission of speakers with physical extent in the ISM model, i.e., scattering effects between the walls and the speaker’s surfaces are not included.

We have also shown that the simplified version (DEISM-LC) converges to its complete version (DEISM) in the free field and yields a small bias when used to simulate RTFs. The simplified version, DEISM-LC, yields a better agreement with the FEM simulations as opposed to the existing FSRR method.

V Conclusion

We have developed a room transfer function simulation method using spherical harmonic directivity coefficients based on the image source method. The directivity coefficients contain information about the sound diffraction around transducers mounted on one or two devices. The proposed method (DEISM) and its simplified version (DEISM-LC) were compared with FEM solutions and evaluated for different acoustic scenarios. The results show that the proposed approaches can simulate room transfer functions with good agreement to FEM simulations when they are not placed close to the room boundaries. A possible rule of thumb to indicate the region inside the room where the DEISM provides results different from FEM is beyond the scope of this paper, and is left as a topic for future studies. The shapes and sizes of the speakers can also introduce different acoustic diffraction effects, resulting in additional differences in errors of the room transfer functions. The simplified approach can significantly reduce computational complexity, especially when the frequency increases.

Future work could include extending the proposed method to incorporate the sound reflections between the device enclosures or between the enclosures and room boundaries. Such extensions might reduce the error incurred when devices are placed close to the room boundaries or when multiple devices are used. Another valuable extension of the proposed approach would be to apply the DEISM and DEISM-LC to rooms with more general geometries, with the aim of efficiently generating more realistic room acoustics simulations.

References

[1] M. R. Schroeder, “Novel uses of digital computers in room acoustics,” J. Acoust. Soc. Am. 33(11), 1669–1669 (1961) https://doi.org/10.1121/1.1936681 \dodoi10.1121/1.1936681.
[2] M. Kleiner, B.-I. Dalenbäck, and P. Svensson, “Auralization-an overview,” J. Audio Eng. Soc. 41(11), 861–875 (1993).
[3] L. Savioja, “Modeling techniques for virtual acoustics,” Ph.D. thesis, Helsinki University of Technology (Espoo, Finland), 1999.
[4] D. Schröder, Physically based real-time auralization of interactive virtual environments (PhD thesis, RWTH Aachen, 2011).
[5] M. Vorländer, “Computer simulations in room acoustics: Concepts and uncertainties,” J. Acoust. Soc. Am. 133(3), 1203–1213 (2013).
[6] C. Schissler and D. Manocha, “Interactive sound propagation and rendering for large multi-source scenes,” ACM Transactions on Graphics (TOG) 36(4), 1 (2016).
[7] M. Vorländer, Auralization : fundamentals of acoustics, modelling, simulation, algorithms and acoustic virtual reality; 2. ed. (Springer, 2020).
[8] Z. Tang, H.-Y. Meng, and D. Manocha, “Learning acoustic scattering fields for dynamic interactive sound propagation,” in 2021 IEEE Virtual Reality and 3D User Interfaces (VR), IEEE (2021), pp. 835–844.
[9] E. Mabande, K. Kowalczyk, H. Sun, and W. Kellermann, “Room geometry inference based on spherical microphone array eigenbeam processing,” J. Acoust. Soc. Am. 134(4), 2773–2789 (2013) https://doi.org/10.1121/1.4820895 \dodoi10.1121/1.4820895.
[10] D. Cherkassky and S. Gannot, “Blind synchronization in wireless acoustic sensor networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(3), 651–661 (2017) \dodoi10.1109/TASLP.2017.2655259.
[11] C. Evers and P. A. Naylor, “Acoustic slam,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 26(9), 1484–1498 (2018) \dodoi10.1109/TASLP.2018.2828321.
[12] S. Das and T. Bäckström, “Enhancement by postfiltering for speech and audio coding in ad hoc sensor networks,” JASA Express Letters 1(1), 015206 (2021) https://doi.org/10.1121/10.0003208 \dodoi10.1121/10.0003208.
[13] A. Herzog and E. A. P. Habets, “Distance estimation in the spherical harmonic domain using the spherical wave model,” Applied Acoustics 193, 108733 (2022) https://www.sciencedirect.com/science/article/pii/S0003682X22001074 \dodoihttps://doi.org/10.1016/j.apacoust.2022.108733.
[14] D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “Robust sound source tracking using srp-phat and 3d convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 300–311 (2020).
[15] F. B. Gelderblom, Y. Liu, J. Kvam, and T. A. Myrvoll, “Synthetic data for dnn-based doa estimation of indoor speech,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), pp. 4390–4394, \dodoi10.1109/ICASSP39728.2021.9414415.
[16] F. Hübner, W. Mack, and E. A. P. Habets, “Efficient training data generation for phase-based doa estimation,” in Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021).
[17] J. Wechsler, W. Mack, and E. A. P. Habets, “End-to-end signal-aware direction-of-arrival estimation using weighted steered-response power,” in 2022 30th European Signal Processing Conference (EUSIPCO), IEEE (2022), pp. 41–45.
[18] P. Srivastava, A. Deleforge, and E. Vincent, “Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators,” in 2022 International Workshop on Acoustic Signal Enhancement (IWAENC) (2022), pp. 1–5, \dodoi10.1109/IWAENC53105.2022.9914740.
[19] P. Srivastava, A. Deleforge, A. Politis, and E. Vincent, “How to (virtually) train your speaker localizer” arXiv, July 25, 2023, preprint, v2.
[20] A. Herzog, S. R. Chetupalli, and E. A. P. Habets, “Ambisep: Ambisonic-to-ambisonic reverberant speech separation using transformer networks,” in 2022 International Workshop on Acoustic Signal Enhancement (IWAENC), IEEE (2022), pp. 1–5.
[21] P.-A. Grumiaux, S. Kitić, L. Girin, and A. Guérin, “A survey of sound source localization with deep learning methods,” J. Acoust. Soc. Am. 152(1), 107–151 (2022).
[22] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Am. 65(4), 943–950 (1979).
[23] P. M. Peterson, “Simulating the response of multiple microphones to a single acoustic source in a reverberant room,” J. Acoust. Soc. Am. 80(5), 1527–1529 (1986) https://doi.org/10.1121/1.394357 \dodoi10.1121/1.394357.
[24] J. C. B. Torres, L. Aspoeck, and M. Vorländer, “Comparative study of two geometrical acoustic simulation models,” Journal of the Brazilian Society of Mechanical Sciences and Engineering 40(6), 1–15 (2018).
[25] Z. Tang, L. Chen, B. Wu, D. Yu, and D. Manocha, “Improving reverberant speech training using diffuse acoustic simulation,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020) http://dx.doi.org/10.1109/ICASSP40776.2020.9052932 \dodoi10.1109/icassp40776.2020.9052932.
[26] E. Bezzam, R. Scheibler, C. Cadoux, and T. Gisselbrecht, “A study on more realistic room simulation for far-field keyword spotting,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE (2020), pp. 674–680.
[27] D. T. Murphy, “Digital waveguide mesh topologies in room acoustics modelling,” Ph.D. thesis, Univ. York, York, U.K., 2000.
[28] K. Kowalczyk, “Boundary and medium modelling using compact finite difference schemes in simulations of room acoustics for audio and architectural design applications,” Ph.D. thesis, Queen’s University Belfast, 2010.
[29] T. Sakuma, S. Sakamoto, and T. Otsuru, Computational simulation in architectural and environmental acoustics (Springer, 2014).
[30] B. Hamilton and S. Bilbao, “Fdtd methods for 3-d room acoustics simulation with high-order accuracy in space and time,” IEEE/ACM Transactions on Audio, Speech, and Language Processing 25(11), 2112–2124 (2017) \dodoi10.1109/TASLP.2017.2744799.
[31] J. A. Hargreaves, L. R. Rendell, and Y. W. Lam, “A framework for auralization of boundary element method simulations including source and receiver directivity,” The Journal of the Acoustical Society of America 145(4), 2625–2637 (2019) https://doi.org/10.1121/1.5096171 \dodoi10.1121/1.5096171.
[32] T. Yoshida, T. Okuzono, and K. Sakagami, “A parallel dissipation-free and dispersion-optimized explicit time-domain fem for large-scale room acoustics simulation.,” Buildings 12(2), 105 (2022) https://doi.org/10.3390/buildings12020105.
[33] M. Aretz, Combined wave and ray based room acoustic simulations of small rooms, Vol. 12 (Logos Verlag Berlin GmbH, 2012).
[34] M. R. Thomas, “Wayverb: A graphical tool for hybrid room acoustics simulation,” Ph.D. thesis, University of Huddersfield, 2017.
[35] A. Ratnarajah, Z. Tang, and D. Manocha, “Ts-rir: Translated synthetic room impulse responses for speech augmentation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2021), pp. 259–266, \dodoi10.1109/ASRU51503.2021.9688304.
[36] A. Luo, Y. Du, M. J. Tarr, J. B. Tenenbaum, A. Torralba, and C. Gan, “Learning neural acoustic fields,” in 36th Conference on Neural Information Processing Systems., NeurIPS (2022), https://openreview.net/pdf?id=D21DRzkZbSB.
[37] M. Aretz, P. Dietrich, and M. Vorländer, “Application of the mirror source method for low frequency sound prediction in rectangular rooms,” Acta Acustica united with Acustica 100(2), 306–319 (2014).
[38] M. R. Schroeder and K. H. Kuttruff, “On frequency response curves in rooms. comparison of experimental, theoretical, and monte carlo results for the average frequency spacing between maxima,” J. Acoust. Soc. Am. 34(1), 76–80 (1962) https://doi.org/10.1121/1.1909022.
[39] E. A. Habets, “Room impulse response generator,” Technische Universiteit Eindhoven, Tech. Rep 2(2.4), 1 (2006).
[40] R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018) http://dx.doi.org/10.1109/ICASSP.2018.8461310 \dodoi10.1109/icassp.2018.8461310.
[41] A. Wabnitz, N. Epain, C. Jin, and A. Van Schaik, “Room acoustics simulation for multichannel microphone arrays,” in Proceedings of the International Symposium on Room Acoustics, Citeseer (2010), pp. 1–6.
[42] D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpurir: A python library for room impulse response simulation with gpu acceleration,” Multimedia Tools and Applications 80(4), 5653–5671 (2020) http://dx.doi.org/10.1007/s11042-020-09905-3 \dodoi10.1007/s11042-020-09905-3.
[43] Z.-h. Fu and J.-w. Li, “Gpu-based image method for room impulse response calculation,” Multimedia Tools and Applications 75(9), 5205–5221 (2016).
[44] M. Kompis and N. Dillier, “Simulating transfer functions in a reverberant room including source directivity and head-shadow effects,” J. Acoust. Soc. Am. 93(5), 2779–2787 (1993).
[45] T. Betlehem and M. Poletti, “Sound field of a directional source in a reverberant room,” Journal of the Acoustical Society of New Zealand 25(4), 12–22 (2012).
[46] F. Brinkmann, V. Erbes, and S. Weinzierl, “Extending the closed form image source model for source directivity,” in DAGA, München (2018).
[47] D. P. Jarrett, E. A. P. Habets, M. R. P. Thomas, and P. A. Naylor, “Rigid sphere room impulse response simulation: algorithm and applications,” J. Acoust. Soc. Am. 132(3), 1462—1472 (2012) https://doi.org/10.1121/1.4740497 \dodoi10.1121/1.4740497.
[48] S. Hafezi, A. H. Moore, and P. A. Naylor, “Modelling source directivity in room impulse response simulation for spherical microphone arrays,” in 2015 23rd European Signal Processing Conference (EUSIPCO), IEEE (2015), pp. 574–578.
[49] P. N. Samarasinghe, T. D. Abhayapala, Y. Lu, H. Chen, and G. Dickins, “Spherical harmonics based generalized image source method for simulating room acoustics,” J. Acoust. Soc. Am. 144(3), 1381–1391 (2018).
[50] Y. Luo and W. Kim, “Fast source-room-receiver acoustics modeling,” in 2020 28th European Signal Processing Conference (EUSIPCO), IEEE (2021), pp. 51–55.
[51] Z. Xu, A. Herzog, A. Lodermeyer, E. A. P. Habets, and A. G. Prinn, “Acoustic reciprocity in the spherical harmonic domain: A formulation for directional sources and receivers,” JASA Express Letters 2, 124801 (2022).
[52] H. Carslaw, “Some Multiform Solutions of the Partial Differential Equations of Physical Mathematics and theri Applications,” Proceedings of the London Mathematical Society s1-30(1), 121–165 (1899) https://academic.oup.com/plms/article/s1-30/1/121/2917781 \dodoi10.1112/plms/s1-30.1.121.
[53] H. Kuttruff, Room Acoustics (CRC Press, 2017).
[54] D. P. Jarrett and E. A. P. Habets, “On the noise reduction performance of a spherical harmonic domain tradeoff beamformer,” IEEE Signal Process. Lett. 19(11), 773–776 (2012).
[55] E. G. Williams, Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography, first ed. (Academic Press, London, 1999).
[56] W. Zhang, T. D. Abhayapala, R. A. Kennedy, and R. Duraiswami, “Insights into head-related transfer function: Spatial dimensionality and continuous representation,” J. Acoust. Soc. Am. 127(4), 2347–2357 (2010) https://doi.org/10.1121/1.3336399 \dodoi10.1121/1.3336399.
[57] P. N. Samarasinghe and T. D. Abhayapala, “Room transfer function measurement from a directional loudspeaker,” in 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC) (2016), pp. 1–5, \dodoi10.1109/IWAENC.2016.7602969.
[58] D. Ward and T. Abhayapala, “Reproduction of a plane-wave sound field using an array of loudspeakers,” IEEE Transactions on Speech and Audio Processing 9(6), 697–707 (2001) \dodoi10.1109/89.943347.
[59] B. Rafaely, Fundamentals of spherical array processing, Vol. 8 (Springer, 2015).
[60] J. Ahrens and S. Bilbao, “Computation of spherical harmonics based sound source directivity models from sparse measurement data,” in Forum Acusticum (2021), pp. 2019–2026, \dodoi10.48465/fa.2020.0042, 10.48465/fa.2020.0042.
[61] J. G. Tylka, R. Sridhar, and E. Choueiri, “A database of loudspeaker polar radiation measurements,” Journal of the Audio Engineering Society (2015).
[62] A. R. Edmonds, Angular Momentum in Quantum Mechanics (Princeton University Press, 2016), https://doi.org/10.1515/9781400884186.
[63] J.-P. Berenger, “A perfectly matched layer for the absorption of electromagnetic waves,” J. Comput. Phys. 114(2), 185–200 (1994) https://www.sciencedirect.com/science/article/pii/S0021999184711594 \dodoihttps://doi.org/10.1006/jcph.1994.1159.
[64] P. M. Morse and R. H. Bolt, “Sound waves in rooms,” Rev. Mod. Phys. 16(2), 69 (1944).
[65] A. Gray and J. Markel, “Distance measures for speech processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing 24(5), 380–391 (1976) \dodoi10.1109/TASSP.1976.1162849.
[66] J. Paasonen, A. Karapetyan, J. Plogsties, and V. Pulkki, “Proximity of surfaces — acoustic and perceptual effects,” J. Audio Eng. Soc. 65(12), 997–1004 (2017) \dodoihttps://doi.org/10.17743/jaes.2017.0039.