This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Two-Step Approach for Narrowband Source Localization in Reverberant Rooms

Abstract

This paper presents a two-step approach for narrowband source localization within reverberant rooms. The first step involves dereverberation by modeling the homogeneous component of the sound field by an equivalent decomposition of planewaves using Iteratively Reweighted Least Squares (IRLS), while the second step focuses on source localization by modeling the dereverberated component as a sparse representation of point-source distribution using Orthogonal Matching Pursuit (OMP). The proposed method enhances localization accuracy with fewer measurements, particularly in environments with strong reverberation. A numerical simulation in a conference room scenario, using a uniform microphone array affixed to the wall, demonstrates real-world feasibility. Notably, the proposed method and microphone placement effectively localize sound sources within the 2D-horizontal plane without requiring prior knowledge of boundary conditions and room geometry, making it versatile for application in different room types.

Index Terms—  Source localization, reverberant environments, sparse representation, sound field decomposition, dereverberation

1 Introduction

Sound source localization plays a crucial role in various acoustic applications, such as speech enhancement [enhancement1, enhancement2], source separation [fahim, s-sh], and sound field translation [translation]. In environments with strong reverberation, the challenge of source localization experiences a considerable escalation. Several source localization methods have been applied to reverberant environments, such as beamforming [beamforming], MUSIC [music_l10n_2, music_l10n], SRP-PHAT [srp-phat], and CLEAN [clean]. These methods use statistical properties of signals to estimate source positions. However, their performance declines significantly when processing narrowband sources and correlated reflections.

Recently, some sparsity-based methods, such as LASSO [lasso], Orthogonal Matching Pursuit (OMP) [omp_l10n], and sparse bayesian learning [sbl], have been introduced to overcome the limitation of narrowband localization by assuming sparse distribution of sources in spatial domain. Nevertheless, these methods continue to struggle in strong reverberation due to the interference of room reflections. Overcoming this challenge often requires either prior knowledge of room geometry and boundary conditions [prior] or the use of sound field decomposition [decompose]. Sound field decomposition entails separating the sound field into a direct component and reverberant component and modeling room reflections as a sum of planewaves or spherical-harmonics [vekua2]. For source localization tasks, the latter approach models room reflections through a sum of planewaves and estimate the source positions based on the dereverberated sound field, such as wavefield separation projector processing (WSPP) [wspp] and sparsity-based spherical harmonics model (S-SH) [s-sh]. These approaches can handle a wide range of scenarios without requiring prior information. However, these methods demand a large number of microphones for modeling the reverberant component and are restricted by the geometry of the microphone array.

In this paper, we propose a similar two-step approach of dereverberation and sparsity based source localization that uses fewer microphones. We first dereverberate the sound field captured in a room by modeling the reverberant component as an equivalent planewave decomposition model [vekua2]. The second step models the dereverberated sound field as superposition of the sparse point-sources and determine the source positions based on the sparse equivalent source method [fernandez2017sparse]. We verify the proposed method through simulations in a conference room scenario, using a linear microphone array around the middle horizontal plane of the wall. The proposed placement effectively captures sound field information, while also not being constrained by room geometry. The results demonstrate that the two-step approach using separate algorithms improves source localization with fewer microphones, especially in rooms with strong reverberation.

2 Problem Formulation

Refer to caption

Fig. 1: Framework for the proposed method.

Consider NN sound sources in a reverberant room, along with MM microphones uniformly placed affixed to the walls, as illustrated in Fig. 1. The positions of the sound sources and microphones are defined as 𝒚n(xn,yn,zn)\boldsymbol{y}_{n}\equiv(x_{n},y_{n},z_{n}) for n=1,2,,Nn=1,2,\dots,N and 𝒙m(xm,ym,zm)\boldsymbol{x}_{m}\equiv(x_{m},y_{m},z_{m}) for m=1,2,,Mm=1,2,\dots,M, respectively, with respect to the coordinate origin OO at the front-left-bottom corner of the room. Note, this configuration indicates that the NN sources are positioned inside the microphone array.

The sound pressure received by the mthm^{\text{th}} microphone is:

s(k,𝒙m)=n=1NG(k,𝒙m,𝒚n)αn(k)s(k,\boldsymbol{x}_{m})=\sum_{n=1}^{N}G(k,\boldsymbol{x}_{m},\boldsymbol{y}_{n})\alpha_{n}(k) (1)

where k=2πf/ck=2\pi f/c is the wave number, ff is frequency, cc is the speed of sound, s(k,𝒙m)s(k,\boldsymbol{x}_{m}) represents the pressure, G(k,𝒙m,𝒚n)G(k,\boldsymbol{x}_{m},\boldsymbol{y}_{n}) denotes the room transfer function between the nthn^{\text{th}} source and the mthm^{\text{th}} microphone, and αn(k)\alpha_{n}(k) denotes the signal produced by the nthn^{\text{th}} source. Note that this formulation does not include additive noise for brevity, which will be considered in the simulation.

Based on sound field decomposition [decompose] and Vekua’s theory [vekua2, vekua], any reverberant sound field can be partitioned into a sum of its particular and homogeneous solutions—equivalent to the direct and reverberant components, respectively. Within a bounded convex region, the reverberant sound field component can be well approximated using a finite number of planewave functions distributed over a spherical region. Consequently, the soundfield can be expressed as a linear combination of NN direct-path point-source Green’s function and LL planewave Green’s function. Equation (1) can thus be decomposed as follows:

s(k,𝒙m)n=1NG0(k,𝒙m,𝒚n)αn(k)Direct+=1LW(k,𝒙m,𝒛^)βReverberants(k,\boldsymbol{x}_{m})\approx\underbrace{\sum_{n=1}^{N}G_{0}(k,\boldsymbol{x}_{m},\boldsymbol{y}_{n})\alpha_{n}(k)}_{\text{\normalsize{Direct}}}+\underbrace{\sum_{\ell=1}^{L}W(k,\boldsymbol{x}_{m},\hat{\boldsymbol{z}}_{\ell})\beta_{\ell}}_{\text{\normalsize{Reverberant}}} (2)

where G0(k,𝒙m,𝒚n)=eik𝒙m𝒚n/(4π𝒙m𝒚n)G_{0}(k,\boldsymbol{x}_{m},\boldsymbol{y}_{n})=e^{ik\|\boldsymbol{x}_{m}-\boldsymbol{y}_{n}\|}/({4\pi\|\boldsymbol{x}_{m}-\boldsymbol{y}_{n}\|}) represents the direct-path Green’s function between the nthn^{\text{th}} source and the mthm^{\text{th}} microphone in a free-field environment. W(k,𝒙m,𝒛^)=eik𝒛^𝒙mW(k,\boldsymbol{x}_{m},\hat{\boldsymbol{z}}_{\ell})=e^{-ik\hat{\boldsymbol{z}}_{\ell}\cdot\boldsymbol{x}_{m}} represents the th\ell^{\text{th}} planewave Green’s function at mthm^{\text{th}} microphone, with 𝒛^\hat{\boldsymbol{z}}_{\ell} denoting the th\ell^{\text{th}} planewave’s incident direction for =1,2,,L\ell=1,2,\dots,L. The coefficient β(k)\beta_{\ell}(k) represents the weight of the th\ell^{\,\text{th}} planewave.

In order to find the source positions, the direct component is modeled using a dictionary of JJ point-sources within the room based on the sparse equivalent source method [fernandez2017sparse]. This method assumes that sound sources exhibit quantity sparsity in spatial domain, indicating NJN\ll J. Finally, equation (2) can be reformulated as follows:

s(k,𝒙m)j=1JG0(k,𝒙m,𝒚j)αj+l=1LW(k,𝒙m,𝒛^)βs(k,\boldsymbol{x}_{m})\approx\sum_{j=1}^{J}G_{0}(k,\boldsymbol{x}_{m},\boldsymbol{y}_{j})\alpha_{j}+\sum_{l=1}^{L}W(k,\boldsymbol{x}_{m},\hat{\boldsymbol{z}}_{\ell})\beta_{\ell} (3)

where G0(k,𝒙m,𝒚j)G_{0}(k,\boldsymbol{x}_{m},\boldsymbol{y}_{j}) denotes the free-field point-source Green’s function between the jthj^{\text{th}} grid point and the mthm^{\text{th}} microphone. Hence, equation (3) is represented in matrix form as:

𝐬=𝐆0𝜶+𝐖𝜷,\boldsymbol{\mathbf{s}}=\mathbf{G}_{0}\boldsymbol{\alpha}+\mathbf{W}\boldsymbol{\beta}, (4)

where 𝐬M\boldsymbol{\mathbf{s}}\in\mathbb{C}^{M} denotes the measured pressure from MM microphones, and 𝐆𝟎M×J\mathbf{G_{0}}\in\mathbb{C}^{M\times J} and 𝐖M×L\mathbf{W}\in\mathbb{C}^{M\times L} represent the dictionary matrices for point-sources and planewaves, respectively. The weight coefficient vectors for point-sources and planewaves are denoted as 𝜶J\boldsymbol{\alpha}\in\mathbb{C}^{J} and 𝜷L\boldsymbol{\beta}\in\mathbb{C}^{L}.

Equation (4) is the sound field decomposition model in reverberant environments. As LL becomes sufficiently large, most elements of 𝜷\boldsymbol{\beta} can be approximated as nearly zero. Additionally, relying on the spatial sparsity assumption for sound source distribution, the 𝜶\boldsymbol{\alpha} vector tends to also have few non-zero elements, given that NN is significantly smaller than JJ. Therefore, the weight coefficients for point sources 𝜶\boldsymbol{\alpha} and plane waves 𝜷\boldsymbol{\beta} can be determined through sparse optimization, as shown below [s-sh]:

argmin𝜶,𝜷𝜶1+λ𝜷1 s.t. 𝐬=𝐆0𝜶+𝐖𝜷,\underset{\boldsymbol{\alpha},\boldsymbol{\beta}}{\operatorname{argmin}}\|\boldsymbol{\alpha}\|_{1}+\lambda\|\boldsymbol{\beta}\|_{1}\quad\text{ s.t. }\mathbf{s}=\mathbf{G}_{0}\boldsymbol{\alpha}+\mathbf{W}\boldsymbol{\beta}, (5)

where λ\lambda is a regularization term.

The objective in this study is to estimate 𝜶\boldsymbol{\alpha} and determine the corresponding NN source positions 𝒚n\boldsymbol{y}_{n} by solving (4), while assuming that the number of sources NN is known. In the following section, we propose to solve (4) in a two-step process.

3 Two-Step Sparse Localization Method

3.1 Dereverberation

The first step, we estimate the reverberant component and perform dereverberation. We start by rearranging (4) as:

𝐬=𝑨𝜸\boldsymbol{\mathbf{s}}=\boldsymbol{A}\boldsymbol{\gamma} (6)

where 𝑨=[𝐆𝟎,𝐖]M×(J+L)\boldsymbol{A=[\mathrm{G}_{0},\mathrm{W}]}\in\mathbb{C}^{M\times(J+L)}, and 𝜸=[𝜶,𝜷](J+L)\boldsymbol{\gamma}=[\boldsymbol{\alpha},\boldsymbol{\beta}]\in\mathbb{C}^{(J+L)}. In this context, we assume M<(J+L)M<(J+L), such that the number of measurements is fewer than the combined total modeled point-sources (grid points) and planewaves in the dictionary. Hence, the estimation of 𝜸\boldsymbol{\gamma} is an underdetermined problem.

We consider solving the linear regression problem (6) through Iteratively Reweighted Least Squares (IRLS) [irls], exploiting an p\ell^{p}-norm approach by adding weights to 2\ell^{2}-norm optimization that iteratively refine the solution’s sparsity (where 0<p20<p\leq 2):

minγi=1wi𝜸i2, subject to 𝐬=A𝜸\min_{\gamma}\sum_{i=1}^{\mathcal{L}}w_{i}\boldsymbol{\gamma}_{i}^{2},\quad\text{ subject to }\mathbf{s}=A\boldsymbol{\gamma} (7)

where wi=|𝜸i(v1)|p2w_{i}=|\boldsymbol{\gamma}_{i}^{(v-1)}|^{p-2} are the weights computed from the previous iteration 𝜸(v1)\boldsymbol{\gamma}^{(v-1)}. Hence, this iterative optimization is a p\ell^{p} objective function. The next iteration 𝜸(v)\boldsymbol{\gamma}^{(v)} is as follows:

𝜸(v)=QvAT(AQvAT)1𝐬\boldsymbol{\gamma}^{(v)}=Q_{v}A^{T}(AQ_{v}A^{T})^{-1}\mathbf{s} (8)

where QvQ_{v} is the diagonal matrix with 1/wi=|𝜸i(v1)|2p1/w_{i}=|\boldsymbol{\gamma}_{i}^{(v-1)}|^{2-p}. We obtain the reverberant component of the sound field from the estimated weight coefficients of planewaves 𝜷^\boldsymbol{\hat{\beta}}, which is extracted from 𝜸^\boldsymbol{\hat{\gamma}}. Therefore, the dereverberation process can be represented as:

𝐬^0=𝐬𝐖𝜷^\boldsymbol{\hat{\mathbf{s}}}_{0}=\boldsymbol{\mathbf{s}}-\mathbf{W}\boldsymbol{\hat{\beta}} (9)

where 𝐬^0\boldsymbol{\hat{\mathbf{s}}}_{0} denotes the estimated direct component, which we use as the dereverberated sound field component in the subsequent source localization step.

3.2 Source Localization

For the second step, we re-estimate 𝜶\boldsymbol{\alpha} from the dereverberated component using OMP. Given the spatial sparsity assumption for sound source distribution, it follows that 𝜶\boldsymbol{\alpha} is assumed to have NN non-zero elements. Hence, we propose using OMP. OMP is a greedy algorithm, adapting an iterative process to select the most relevant sound source at each step [omp]. By forcing inactive weights to zero, OMP improves the accuracy of source localization estimation. Utilizing the result from (9), we express the direct sound field component as:

𝐬𝟎^=𝐆0𝜶~\boldsymbol{\hat{\mathbf{s}_{0}}}=\mathbf{G}_{0}\boldsymbol{\widetilde{\alpha}} (10)

where 𝜶~\boldsymbol{\widetilde{\alpha}} denotes the estimated point-source weight vector determined by OMP. In practice, source positions are found by selecting the weight coefficient with the highest correlation in each iteration. The OMP algorithm as follows:

Algorithm 1 Dereverberated OMP localization

Input: measurements 𝐬0^\hat{\mathbf{s}_{0}}, point-source dictionary 𝐆0\mathbf{G}_{0}, number of sources NN
Output: estimated weight coefficients 𝜶~\boldsymbol{\widetilde{\alpha}}, estimated source positions 𝒚~𝒏\boldsymbol{\widetilde{y}_{n}}

Λ0=ø,Ψ0=ø,𝒈𝒊=𝐆0[:,i]\Lambda_{0}=\o,\Psi_{0}=\o,\boldsymbol{g_{i}}=\mathbf{G}_{0}[:,i] for i{1,2,,J}i\in\{1,2,\ldots,J\}
for  n=1n=1 to NN  do
     𝒚~𝒏argmaxj{1,2,,J}|𝐬𝟎^,𝒈𝒋|\boldsymbol{\widetilde{y}_{n}}\leftarrow\arg\max_{j\in\{1,2,\ldots,J\}}|\boldsymbol{\langle\hat{\mathbf{s}_{0}}},\boldsymbol{g_{j}}\rangle|
     ΨnΨn1{𝒚~𝒏}\Psi_{n}\leftarrow\Psi_{n-1}\cup\{\boldsymbol{\widetilde{y}_{n}}\}
     ΛnΛn1{𝒈𝒋}\Lambda_{n}\leftarrow\Lambda_{n-1}\cup\{\boldsymbol{g_{j}}\}
     𝐬𝟎^𝐬𝟎^ΛnΛn𝐬𝟎^\boldsymbol{\hat{\mathbf{s}_{0}}}\leftarrow\boldsymbol{\hat{\mathbf{s}_{0}}}-\Lambda_{n}\Lambda_{n}^{\dagger}\boldsymbol{\hat{\mathbf{s}_{0}}}
end for
α~ΨNΛNΛN𝐬𝟎^\widetilde{\alpha}_{\Psi_{N}}\leftarrow\Lambda_{N}\Lambda_{N}^{\dagger}\boldsymbol{\hat{\mathbf{s}_{0}}}

4 SIMULATION RESULTS

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Fig. 2: Point-source weights at f=1000f=1000 Hz in xyxy-plane of (a) ground truth, (b,f) ND-OMP, (c,g) D-IRLS, (d,h) WSPP, and (e,i) proposed, with (a) to (e) for T60=0.75T_{60}=0.75 s and (f) to (i) for T60=1.5T_{60}=1.5 s.

Refer to caption


Fig. 3: Average localization errors for the four source locations at f=1000f=1000 Hz averaged over 100 Monte Carlo tests.

In this section, we evaluate the proposed method in a simulated reverberant room. To emulate practical scenarios, we focus on a conference room environment with multiple participants conversing. We assume that all sound sources are positioned at approximately 1.6 m height from the ground sitting around the conference table. Therefore, we evaluate the performance of the proposed method for 2D-horizontal plane localization as this minimizes the required microphones for a practical implementation.

For comparison, we evaluate three other methods: non-dereverberation by OMP (ND-OMP), WSPP [wspp], and simultaneous dereverberation and localization by IRLS (D-IRLS). Specifically, ND-OMP estimates the source position directly using OMP without the dereverberation step. WSPP is a method based on planewave decomposition using OMP, eliminating the ambient interference by a linear projection operator. D-IRLS estimates both the direct and reverberant component simultaneously through IRLS, thereby determining 𝜶\boldsymbol{\alpha} by solving (6). Here, we select L=70L=70 planewaves for WSPP and L=3000L=3000 for D-IRLS.

We use the RIR generator toolbox [rirgenerator] to simulate a 4.1×6.2×3.94.1\times 6.2\times 3.9 m reverberant room. We note that while here we model a rectangular shoebox room, our proposed localzation method does not rely on a known room geometry. We position N=4N=4 sources within the height range of 1.55z1.651.55\leq z\leq 1.65 m. Then, we generate the source signals by convolving the image source RIRs with clean speech signals. We select two female and two male speech sources taken from the MS-SNSD dataset [ms-snsd]. The sampling frequency is 16 kHz. The measurements SNR are 30 dB. The STFT parameters are 16384 for the frame length with 50% overlap. The frequency bin we select is 1 kHz, as k=18.48k=18.48 while the speed of sound cc is 340 m/s. Matching the room geometry, we use a uniform rectangular microphone array with M=106M=106 microphones, spaced 0.2 m apart, affixed to the perimeter of the walls at a height of z=1.60z=1.60 m. For the reverberant sound field, we select L=3000L=3000 planewaves and J=1600J=1600 point-sources (grid points) on the z=1.60z=1.60 m plane, following the 2D localization in a conference room scenario. We assume that the reflection from ceilings and floor are inactive. This implies that we are simulating a 2D soundfield using [rirgenerator].

We first evaluate two specific cases, as shown in Fig. 2. We fix the four source positions as detailed in Fig. 2(a). We select two sets of wall reflection coefficients to compare the performance at different reverberation time (T60T_{60}): [0.9,0.93,0.94,0.94,0,00.9,0.93,0.94,0.94,0,0] and [0.99,0.98,0.98,0.99,0,00.99,0.98,0.98,0.99,0,0] with the reflection order of image source model as 30, equivalent to 0.75 s and 1.5 s T60T_{60}, respectively.

In Fig. 2, we present the estimated weight vector of point-sources 𝜶~\boldsymbol{\widetilde{\alpha}} compared to the true source weights for both a high and extreme reverberation room. Fig. 2(a) shows the ground truth. Starting with ND-OMP in (b), we see that this method can successfully localize the sources without dereverberation when T60T_{60} is medium. However, the performance in (f) degrades when T60T_{60} is high owing to interference from room reflections. Although D-IRLS in (c) and (g) provides a good ability to cancel interference from room reflections, this method is unstable to obtain a sparse solution when estimating 𝜶~\boldsymbol{\widetilde{\alpha}}. In figure (d) and (h), the WSPP method is typically effective with a grid-point microphone array [omp_l10n], but struggles in our setup due to its constrained microphone placements.

The proposed method as observed in (e) and (i) is seen to have the best performance. The two-step approach offers robust localization by combining the advantages of IRLS and OMP. IRLS excels at solving underdetermined systems, so it is not necessary to comply L2kr+1L\simeq 2\lceil kr\rceil+1, where rr is the measurement area radius, for optimal dereverberation [vekua2]. This enhances dereverberation performance and prevents overfitting of the reverberant component, allowing flexibility in the number of planewaves to deal with different frequency bins. OMP then provides a strict sparse solution, enhancing source localization and enabling source loudness estimation.

In the second evaluation, we compare the average localization errors (LE) across 100 Monte Carlo test samples for different reverberation scenarios. We randomly placed the four sources inside the conference room at a height range 1.55z1.651.55\leq z\leq 1.65 m. The T60T_{60} is varied from 0.5 to 1.5 s. We define the LE between the estimated and true source positions to evaluate the performance of the source localization methods as follows:

𝐋𝐄=1Nn=1N𝒚~𝒏𝒚n2\mathbf{LE}=\frac{1}{N}\sum_{n=1}^{N}\|\boldsymbol{\widetilde{y}_{n}}-\boldsymbol{y}_{n}\|_{2} (11)

The results of Fig. 3 show that the proposed method provides robust source position estimation. As T60T_{60} increases, the method maintains stable LE values, while the performances of other methods degrade. However, we note that the average LE of the proposed method at T60=0.5T_{60}=0.5 s remains relatively high, primarily due to the magnitude differences among different sources. Specifically, if the magnitude of one source is much lower than of the other sources, localizing this particular source becomes challenging because it might be regarded as noise when using sparse recovery methods. The presented results illustrated the robust performance achieved in 2D horizontal-plane localization. However, it is worth to note that the proposed method can be extended to 3D scenarios with 3D point-source and planewave dictionaries, and a 3D microphone array.

5 CONCLUSION

In this paper, we have introduced an enhanced method for narrowband source localization in reverberant environments. Based on sparse representation and planewave dereverberation, the two-step approach improves the localization accuracy by estimating the direct and reverberant component separately. The simulation results show that the proposed method can effectively localize multiple sources with a reduced number of measurements, particularly in scenarios with high T60T_{60}. Moreover, our method overcomes the overfitting problem in planewave decomposition, allowing for a more flexible and straightforward determination of planewave quantities. The scope of future work includes resolving the magnitude issue in sparse recovery algorithms and considering wideband scenarios to reduce microphones further.