This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Subband Splitting: Simple, Efficient and Effective Technique for Solving Block Permutation Problem in Determined Blind Source Separation

Kazuki Matsumoto1, and Kohei Yatabe2, k_m_w_314@akane.waseda.jpyatabe@go.tuat.ac.jp1Department of Communications and Computer Engineering1Department of Communications and Computer Engineering Waseda University Waseda University
334411 Ohkubo
334411 Ohkubo Shinjuku-ku Shinjuku-ku Tokyo Tokyo16916985558555 Japan
2Department of Electrical Engineering and Computer Science16916985558555 Japan
2Department of Electrical Engineering and Computer Science Tokyo University of Agriculture and Technology Tokyo University of Agriculture and Technology2224241616 Naka-cho2224241616 Naka-cho Koganei-shi Koganei-shi Tokyo Tokyo18418485888588 Japan
18418485888588 Japan
Abstract

Solving the permutation problem is essential for determined blind source separation (BSS). Existing methods, such as independent vector analysis (IVA) and independent low-rank matrix analysis (ILRMA), tackle the permutation problem by modeling the co-occurrence of the frequency components of source signals. One of the remaining challenges in these methods is the block permutation problem, which may cause severe performance degradation. In this paper, we propose a simple and effective technique for solving the block permutation problem. The proposed technique splits the entire frequency bands into several overlapping subbands and sequentially applies BSS methods (e.g., IVA, ILRMA, or any other method) to each subband. Since the splitting reduces the size of the problem, the BSS methods can effectively work in each subband. Then, the permutations among the subbands are aligned by using the separation result in one subband as the initial values for the other subbands. Additionally, we propose SS-IVA and SS-ILRMA by combining subband splitting (SS) with IVA and ILRMA. Experimental results demonstrated that our technique remarkably improves the separation performance without increasing computational cost. In particular, our SS-ILRMA achieved the separation performance comparable to the oracle method (frequency-domain independent component analysis with the ideal permutation solver). Moreover, SS-ILRMA converged faster than conventional IVA and ILRMA.

Multichannel source separation, independent vector analysis (IVA), independent low-rank matrix analysis (ILRMA), optimization, band splitting.

1 Introduction

Determined blind source separation (BSS) is a technique for separating the source signals from multichannel observed signals. It is usually formulated as an optimization problem of the demixing matrices in the time-frequency domain. To separate the signals with this formulation, addressing the permutation problem is essential. Namely, the order of extraction target must be kept consistent across all frequencies[1, 2]. While a permutation solver[3, 4, 5, 6, 7, 8, 9] is required in frequency-domain independent component analysis (FDICA)[1, 2], later methods, e.g., independent vector analysis (IVA)[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22], independent low-rank matrix analysis (ILRMA)[23, 24, 25, 26, 27, 28, 29, 30], and those assisted by deep learning [31, 32, 33, 34, 35, 36, 37, 38], model the co-occurrence of the frequency components and align the permutation within the optimization algorithms.

Refer to caption
Figure 1: Illustration of the proposed technique named subband splitting. The observed signal 𝐱\mathbf{x} is split into overlapping subbands (𝐱i)i=1I(\mathbf{x}_{\mathcal{F}_{i}})_{i=1}^{I}, and a BSS method (e.g., IVA, ILRMA, or any other method) sequentially separates each subband i{\mathcal{F}_{i}} by using the demixing matrices 𝐖i\mathbf{W}_{\!\mathcal{F}_{i}}. The separation results in the iith subband i\mathcal{F}_{i}, including the auxiliary variable 𝚯i\boldsymbol{\Theta}_{\!\mathcal{F}_{i}}, is used as the initial values in the next subband i+1\mathcal{F}_{i+1}, which aligns the permutation among the subbands.

Although these methods can align the permutation to some extent, they sometimes fail due to block permutation problem[16, 28, 29, 8, 36]. That is, several frequency blocks with inconsistent permutations may arise and cause severe performance degradation. To deal with the block permutation problem, the following three types of approaches have been employed in many existing methods. The first one utilizes an external permutation solver tailored for block permutation problem[29, 8]. The second one incorporates spatial information (e.g., the directions of the sources) into BSS algorithms[28, 21, 22]. The last approach is to improve the source models. In particular, aiming to precisely model the frequency-band-wise structures of the sources, several methods incorporated subband structures into the source model of IVA and ILRMA[16, 17, 18, 19, 20, 30]. These three approaches have mitigated the block permutation problem. However, some of these methods may require additional computational costs for external solvers (the first approach), while others may require efforts to develop optimization algorithms, especially when incorporating more sophisticated source models (the second and third approaches).

In this paper, we propose a simple technique named subband splitting (SS) to enhance the separation performance of existing BSS algorithms111 The same proposal has already been uploaded on arXiv[39] . As illustrated in Fig. 1, our technique splits the entire frequency bands into several overlapping subbands and then sequentially applies BSS algorithms to each subband. Owing to the splitting, the size of the optimization problem in each subband is reduced, and therefore source separation can be more easily done. Then, the separation results from one subband are used to initialize BSS algorithms in the subsequent subbands, which aligns the permutations across all subbands. Notably, our technique does not require any modification to the BSS algorithms nor additional computational cost. Therefore, it can be directly combined with various BSS algorithms and can improve their separation performance without paying additional costs.

We additionally propose SS-IVA and SS-ILRMA, in which IVA and ILRMA are sequentially applied to each subband. They demonstrate the usefulness of our technique. The experimental results showed that our technique improved the separation performance of both IVA and ILRMA. The robustness of ILRMA was especially improved, achieving a separation performance comparable to the oracle method (FDICA with the ideal permutation solver (IPS)). Moreover, the proposed SS-IVA and SS-ILRMA empirically required fewer iterations and shorter runtime to converge, compared to the conventional IVA and ILRMA.

The rest of the paper is organized as follows. Section 2 outlines determined BSS and the block permutation problem. In Section 3, we propose the subband splitting technique for arbitrary BSS algorithms. In Section 4, we propose SS-IVA and SS-ILRMA and experimentally investigate their separation performance and runtime. Section 5 concludes the paper.

Refer to caption

(a) Observed

Refer to caption

(b) IVA (SDRi = 1.29 dB)

Refer to caption

(c) Our SS-IVA (SDRi = 14.44 dB)

Figure 2: Example of separation results and their evaluation metric (SDRi) in a two-channel and two-source situation (M=N=2M=N=2). The observed signal (a) is generated from dev1_male4_src_1.wav and dev1_male4_src_2.wav in SiSEC 2011 dataset[40], by convolving them with room impulse responses recorded in a real room [41]. The sources are placed at 75-75^{\circ} and 6060^{\circ}, respectively, and the other experimental conditions are the same as those described in Section 4.3 The green and red colors represent the two different sources, where the ratio of their energy was calculated using the oracle sources. For visibility, the frequency axis is trimmed from 0 to 3 kHz. For the separated signals (b) and (c), the colors of the left bars indicate the dominant source in each frequency band, where the letter G and R corresponds to green and red, respectively. While the conventional IVA in (b) resulted in poor SDRi due to the block permutation problem, our proposed SS-IVA was able to successfully align the permutations across all frequencies, even though the BSS algorithm used in (b) and (c) was the same (i.e., AuxIVA[12]).

2 Preliminaries

2.1 Determined BSS

Determined BSS can be formulated as an optimization problem of the demixing matrices in the time-frequency domain. Let 𝐬ft=[sft1,,sftN]𝖳N\mathbf{s}_{ft}=[s_{ft1},\ldots,s_{ftN}]^{\mathsf{T}}\in\mathbb{C}^{N} be a vector of NN source signals at the (f,t)(f,t)th bin, where 1fF1\leq f\leq F and 1tT1\leq t\leq T are frequency and time indices, respectively. The MM-channel observed signal 𝐱ft=[xft1,,xftM]𝖳M\mathbf{x}_{ft}=[x_{ft1},\ldots,x_{ftM}]^{\mathsf{T}}\in\mathbb{C}^{M} is approximated using the frequency-wise mixing matrix 𝐀fM×N\mathbf{A}_{f}\in\mathbb{C}^{M\times N} as 𝐱ft=𝐀f𝐬ft\mathbf{x}_{ft}=\mathbf{A}_{f}\mathbf{s}_{ft}. In a determined situation (i.e., NMN\leq M), the nnth separated signal at the (f,t)(f,t)th bin is extracted using the nnth demixing vector 𝐰fnM\mathbf{w}_{\!fn}\in\mathbb{C}^{M} as follows:

yftn=𝐰fn𝖧𝐱ft.y_{ftn}=\mathbf{w}_{\!fn}^{\mathsf{H}}\mathbf{x}_{\!ft}. (1)

Then, the NN separated signals 𝐲ft=[yft1,,yftN]𝖳N\mathbf{y}_{\!ft}=[y_{ft1},\ldots,y_{ftN}]^{\mathsf{T}}\in\mathbb{C}^{N} are obtained by

𝐲ft=(yft1yftN)=(𝐰f1𝖧𝐱ft𝐰fN𝖧𝐱ft)=𝐖f𝐱ft,\mathbf{y}_{\!ft}=\begin{pmatrix}y_{ft1}\\ \vdots\vspace{-3pt}\\ y_{ftN}\\ \end{pmatrix}=\begin{pmatrix}\mathbf{w}_{\!f1}^{\mathsf{H}}\mathbf{x}_{\!ft}\\ \vdots\vspace{-3pt}\\ \mathbf{w}_{\!fN}^{\mathsf{H}}\mathbf{x}_{\!ft}\\ \end{pmatrix}=\mathbf{W}_{\!f}\mathbf{x}_{ft}, (2)

where

𝐖f=(𝐰f1𝖧𝐰fN𝖧)N×M\mathbf{W}_{\!f}=\begin{pmatrix}\mathbf{w}_{\!f1}^{\mathsf{H}}\\ \vdots\vspace{-3pt}\\ \mathbf{w}_{\!fN}^{\mathsf{H}}\end{pmatrix}\in\mathbb{C}^{N\times M} (3)

is the demixing matrix. For notational simplicity, we omit the indices to represent all the components altogether as 𝐱=((𝐱ft)f=1F)t=1T\mathbf{x}=((\mathbf{x}_{ft})_{f=1}^{F})_{t=1}^{T}, 𝐲=((𝐲ft)f=1F)t=1T\mathbf{y}=((\mathbf{y}_{\!ft})_{f=1}^{F})_{t=1}^{T} and 𝐖=(𝐖f)f=1F\mathbf{W}=(\mathbf{W}_{\!f})_{f=1}^{F}. Then, the linear operation in Eq. (2) for all frequencies f=1,,Ff=1,\ldots,F and times t=1,,Tt=1,\ldots,T is shortly represented as 𝐲=𝐖𝐱\mathbf{y}=\mathbf{W}\mathbf{x} for brevity.

To separate the sources using Eq. (2), addressing the permutation problem is essential. Namely, the extraction target of demixing vector 𝐰fn𝖧\mathbf{w}_{\!fn}^{\mathsf{H}} in Eq. (1) must be consistent across all frequencies f=1,,Ff=1,\ldots,F. A standard approach to the permutation problem is to model the co-occurrence of the frequency components. For instance, IVA [10, 11] considers the frequency-directional group structure, which results in the minimization problem of the following objective function:

(𝐖)=n=1Nt=1Tf=1F|yftn|22Tf=1Flog(|det(𝐖f)|),\mathcal{L}(\mathbf{W})=\sum_{n=1}^{N}\sum_{t=1}^{T}\sqrt{\sum_{f=1}^{F}|y_{ftn}|^{2}}-2T\sum_{f=1}^{F}\log(|\det(\mathbf{W}_{\!f})|), (4)

where yftn=𝐰fn𝖧𝐱fty_{ftn}=\mathbf{w}_{\!fn}^{\mathsf{H}}\mathbf{x}_{\!ft}. ILRMA[23] utilizes nonnegative matrix factorization (NMF) [42] to model low-rankness of the power spectrograms of sources. The typical objective function is

(𝐖,𝚯)=\displaystyle\mathcal{L}(\mathbf{W},\boldsymbol{\Theta})= n=1Nt=1Tf=1F(|yftn|2𝐭fn𝖳𝐯tn+log(𝐭fn𝖳𝐯tn))\displaystyle\sum_{n=1}^{N}\sum_{t=1}^{T}\sum_{f=1}^{F}\left(\frac{|y_{ftn}|^{2}}{\mathbf{t}_{fn}^{\mathsf{T}}\!\mathbf{v}_{tn}}+\log(\mathbf{t}_{fn}^{\mathsf{T}}\!\mathbf{v}_{tn})\right)
2Tf=1Flog(|det(𝐖f)|),\displaystyle\hskip 50.0pt-2T\sum_{f=1}^{F}\log(|\det(\mathbf{W}_{\!f})|), (5)

where 𝚯=((𝐓n)n=1N,(𝐕n)n=1N)\boldsymbol{\Theta}=((\mathbf{T}_{n})_{n=1}^{N},(\mathbf{V}_{n})_{n=1}^{N}) represents the set of auxiliary variables for NMF, 𝐓n=[𝐭1n,,𝐭Fn]𝖳+F×K\mathbf{T}_{n}=[\mathbf{t}_{1n},\ldots,\mathbf{t}_{Fn}]^{\mathsf{T}}\in\mathbb{R}_{+}^{F\times K} and 𝐕n=[𝐯1n,,𝐯Tn]+K×T\mathbf{V}_{n}=[\mathbf{v}_{1n},\ldots,\mathbf{v}_{Tn}]\in\mathbb{R}_{+}^{K\times T} are the basis and activation matrices, respectively, and KK is the number of bases.

2.2 Block Permutation Problem

Despite the great success of the above methods, they occasionally fail in separation due to the block permutation problem[16, 28, 29, 8, 36]. Figure 2 illustrates such a failure by an example of signals separated by IVA, where M=N=2M=N=2 and each of the two source signals is colored either green or red. As in Fig. 2 (b), three frequency blocks appeared in this case: the lower- and higher-frequency blocks extracting the green source and the middle-frequency block for the red source. Consequently, IVA did not keep consistent permutations among these frequency blocks, resulting in poor separation performance.

However, even when the block permutation problem arises, the demixing matrices obtained for each frequency block are often optimized correctly, except for their permutations. As an example, Fig. 3 displays the frequency-wise separation performance of the result shown in Fig. 2 (b). We calculated the frequency-wise version of the scale-invariant signal-to-distortion ratio (SI-SDR) of the separated signals 𝐲\mathbf{y} in the time-frequency domain as follows:

SI-SDRf\displaystyle\textrm{SI-SDR}_{f}
=10log10(n=1Nt=1T|s~ftn|2n=1Nt=1T|s~ftnyftn|2),\displaystyle\quad{}=10\log_{10}\left(\frac{\sum_{n=1}^{N}\sum_{t=1}^{T}|\tilde{s}_{ftn}|^{2}}{\sum_{n=1}^{N}\sum_{t=1}^{T}|\tilde{s}_{ftn}-y_{ftn}|^{2}}\right)\!, (6)

where s~ftn\tilde{s}_{ftn}\in\mathbb{C} is the scaled version of the nnth true source signal at the (f,t)(f,t)th bin, i.e.,

s~ftn\displaystyle\tilde{s}_{ftn} =(t=1Ty¯ftnsftnt=1T|sftn|2)sftn,\displaystyle=\left(\frac{\sum_{t=1}^{T}\bar{y}_{ftn}s_{ftn}}{\sum_{t=1}^{T}|s_{ftn}|^{2}}\right)\cdot s_{ftn}, (7)

and complex conjugation is denoted by ()¯\bar{(\cdot)}. The area filled with light gray in Fig. 3 corresponds to the separation result of IVA in Fig. 2 (b). The dotted line shows IVA + IPS, where IPS ideally resolved the permutation problem of the result of IVA so that the correlation between the separated signal and the true source is maximized (see Eq. (26)). For reference, the blue line indicates FDICA with IPS (FDICA + IPS). In this example, the separation performance of IVA was extremely bad within 0.5 to 1.5 kHz, which corresponds to the middle-frequency block in Fig. 2 (b), where the red source is extracted. However, IVA + IPS was comparable to that of FDICA + IPS across all frequencies, which highlights that if the block permutation was circumvented, then IVA performs effectively. This observation motivates us to propose some specialized techniques to avoid the block permutation problem for IVA or other existing BSS algorithms.

While the block permutation problem arises from various factors, one major factor is considered to be the difficulty in handling the complicated structure of audio signals. In speech signals, for example, there are two main components: vowels dominating the lower-frequency bands and consonants appearing in the higher-frequency bands. When a BSS algorithm attempts to separate the sources, it must align the permutations of these two components, which is not straightforward since they appear in different frequency bands and at different times. At the same time, locally looking at a narrower frequency band (e.g., the mid-frequency block in Fig. 2 (b)), only a few components are dominant and have a simple structure. Therefore, it should be easy for existing BSS algorithms to resolve permutations inside such narrower bands. If we split the BSS problem into a set of several BSS problems in the narrower frequency bands, then we can focus on how to align the permutations between the multiple bands. The block permutation solvers can be used for this approach with some additional computational costs, but we can resolve the block permutation problem without an additional cost, as described in the next section.

Refer to caption
Figure 3: Frequency-wise SI-SDR of the separation result in Fig. 2 (b), where the frequency bins are decimated by a factor of 16 for visibility. The light gray area represents the performance of IVA, where the block permutation problem arises in frequency bands from 0.5 to 1.5 kHz. The dotted line refers to IVA after the permutations are corrected using IPS. For reference, the blue line shows the result of FDICA + IPS.

3 Proposed Method

To solve the block permutation problem without paying additional costs, we propose a simple technique named subband splitting. The proposed method splits all the frequencies into several subbands and sequentially separates them using existing BSS methods. Narrowing the frequency band makes it easier to resolve permutations within each subband. Then, the permutations between the subbands are aligned by using the separation result in one subband as the initial values for the subsequent subbands. As in Fig. 2 (c), our method enhances the separation performance of IVA without modifying the algorithm. Here, we introduce the proposed technique and describe the way to generate the subbands used in our technique. Subsequently, we note the number of iterations and the computational cost of the proposed technique.

3.1 Subbands and BSS Method

In the proposed method, the entire frequencies are split into II subbands (i)i=1I(\mathcal{F}_{i})_{i=1}^{I}, where the iith subband i\mathcal{F}_{i} is the set of frequency indices,

i={f{1,,F}LifHi},\mathcal{F}_{i}=\left\{f\in\{1,\ldots,F\}\mid L_{i}\leq f\leq H_{i}\right\}, (8)

FF is the number of frequency bins, and LiL_{i} and HiH_{i} (LiHi)(L_{i}\leq H_{i}) are the lower and upper bounds for the iith subband i\mathcal{F}_{i}, respectively. All frequency indices must be contained in at least one subband, and hence (i)i=1I(\mathcal{F}_{i})_{i=1}^{I} must satisfy

i=1Ii={1,,F}.\bigcup_{i=1}^{I}\mathcal{F}_{i}=\{1,\ldots,F\}. (9)

Additionally, the adjacent subbands must overlap, i.e.,

ii+1(1iI1).\mathcal{F}_{i}\cap\mathcal{F}_{i+1}\neq\emptyset\qquad(1\leq i\leq I-1). (10)

For notational convenience, let the separation procedure of a BSS algorithm (e.g., IVA and ILRMA) be written as

(y,𝐖,𝚯)𝖡𝖲𝖲(𝐱,𝐖(init),𝚯(init)),(\textbf{y},\mathbf{W},\boldsymbol{\Theta})\leftarrow\mathsf{BSS}(\mathbf{x},\mathbf{W}^{(\text{init})},\boldsymbol{\Theta}^{(\text{init})}), (11)

where the BSS algorithm receives an observed signal 𝐱\mathbf{x}, an initial value of the demixing matrix 𝐖(init)\mathbf{W}^{(\text{init})}, and initial values of other auxiliary variables 𝚯(init)\boldsymbol{\Theta}^{(\text{init})} (e.g., NMF variables (𝐓n)n=1N(\mathbf{T}_{n})_{n=1}^{N} and (𝐕n)n=1N(\mathbf{V}_{n})_{n=1}^{N} of ILRMA in Eq. (5) or any other variables necessary for the BSS algorithm). After running the BSS algorithm, it returns separated signals 𝐲\mathbf{y}, the corresponding demixing matrix 𝐖\mathbf{W}, and the corresponding auxiliary variables 𝚯\boldsymbol{\Theta}.

3.2 Proposed Method: Subband Splitting

Using the above notation, the proposed subband splitting for any BSS algorithm 𝖡𝖲𝖲()\mathsf{BSS}(\cdot) is given as follows:

(𝐱i,𝐖i,𝚯i)𝖾𝗑𝗍𝗋𝖺𝖼𝗍i(𝐱,𝐖,𝚯),(𝐲i,𝐖i,𝚯i)𝖡𝖲𝖲(𝐱i,𝐖i,𝚯i),(𝐲,𝐖,𝚯)𝗌𝗎𝖻𝗌𝗍𝗂𝗍𝗎𝗍𝖾i(𝐲i,𝐖i,𝚯i),\left\lfloor\begin{aligned} \;(\mathbf{x}_{\mathcal{F}_{i}},\mathbf{W}_{\!\mathcal{F}_{i}},\boldsymbol{\Theta}_{\mathcal{F}_{i}})&\leftarrow\mathsf{extract}_{\mathcal{F}_{i}}(\mathbf{x},\mathbf{W},\boldsymbol{\Theta}),\\ (\mathbf{y}_{\mathcal{F}_{i}},\mathbf{W}_{\!\mathcal{F}_{i}},\boldsymbol{\Theta}_{\mathcal{F}_{i}})&\leftarrow\mathsf{BSS}(\mathbf{x}_{\mathcal{F}_{i}},\mathbf{W}_{\!\mathcal{F}_{i}},\boldsymbol{\Theta}_{\mathcal{F}_{i}}),\\ (\mathbf{y},\mathbf{W},\boldsymbol{\Theta})&\leftarrow\mathsf{substitute}_{\mathcal{F}_{i}}(\mathbf{y}_{\mathcal{F}_{i}},\mathbf{W}_{\!\mathcal{F}_{i}},\boldsymbol{\Theta}_{\mathcal{F}_{i}}),\vspace{-10pt}\end{aligned}\right. (12)

where 𝖾𝗑𝗍𝗋𝖺𝖼𝗍i()\mathsf{extract}_{\mathcal{F}_{i}}(\cdot) is the operator that extracts the part of variables corresponding to the iith subband i\mathcal{F}_{i}. The extracted part is indicated by the subscript ()i(\cdot)_{\mathcal{F}_{i}} as

𝐱i=((𝐱ft)fi)t=1T,𝐖i=(𝐖f)fi.\displaystyle\mathbf{x}_{\mathcal{F}_{i}}=((\mathbf{x}_{ft})_{f\in{\mathcal{F}_{i}}})_{t=1}^{T},\qquad\mathbf{W}_{\!\mathcal{F}_{i}}=(\mathbf{W}_{\!f})_{f\in{\mathcal{F}_{i}}}. (13)

The operator 𝗌𝗎𝖻𝗌𝗍𝗂𝗍𝗎𝗍𝖾i()\mathsf{substitute}_{\mathcal{F}_{i}}(\cdot) substitutes the extracted part of the variables back into their original location as

((𝐲ft)fi)t=1T𝐲i,(𝐖f)fi𝐖i,\displaystyle((\mathbf{y}_{\!ft})_{f\in{\mathcal{F}_{i}}})_{t=1}^{T}\leftarrow\mathbf{y}_{\mathcal{F}_{i}},\qquad(\mathbf{W}_{\!f})_{f\in{\mathcal{F}_{i}}}\leftarrow\mathbf{W}_{\!\mathcal{F}_{i}}, (14)

namely, the part of optimization variables associated with the subband i\mathcal{F}_{i} are replaced with the latest optimization result provided by 𝖡𝖲𝖲()\mathsf{BSS}(\cdot). The proposed method repeats Eq. (12) for i=1,,Ii=1,\ldots,I so that 𝖡𝖲𝖲()\mathsf{BSS}(\cdot) is applied to all the subbands.

Note that 𝖾𝗑𝗍𝗋𝖺𝖼𝗍i()\mathsf{extract}_{\mathcal{F}_{i}}(\cdot) and 𝗌𝗎𝖻𝗍𝗂𝗍𝗎𝗍𝖾i()\mathsf{subtitute}_{\mathcal{F}_{i}}(\cdot) for the auxiliary variables 𝚯\boldsymbol{\Theta} should be defined similarly, depending on the BSS algorithms used in each subband. Specific definitions of the extraction/substitution operators for SS-IVA and SS-ILRMA are described in Section 4.1

The most important aspect of the proposed method is the existence of the overlap of the subbands. In the overlapping part ii+1\mathcal{F}_{i}\cap\mathcal{F}_{i+1}, the output of 𝖡𝖲𝖲()\mathsf{BSS}(\cdot) within the iith subband is carried over to the (i+1)(i+1)th subband as the initial value for 𝖡𝖲𝖲()\mathsf{BSS}(\cdot). Therefore, within the (i+1)(i+1)th subband, the BSS algorithm is induced to obtain the same permutation as that obtained in the iith subband. This initialization strategy can keep consistent permutation among the subbands if there are sufficient overlaps.

3.3 Subband Generation by Shift Rules

To generate the subbands, we define shift rules that constantly shift the subbands upward or downward:

(Li+1,Hi+1)={(Li+Δ,Hi+Δ)(Up)(LiΔ,HiΔ)(Down)(L_{i+1},H_{i+1})=\begin{cases}(L_{i}+\Delta,H_{i}+\Delta)&(\textrm{Up})\\ (L_{i}-\Delta,H_{i}-\Delta)&(\textrm{Down})\end{cases} (15)

where Δ>0\Delta>0 is the amount of shift.

To easily compare the settings for subbands, we introduce subband parameter (θW,θΔ)(\theta_{W},\theta_{\Delta}). The parameter θW1\theta_{W}\geq 1 determines the width of subband WW (=HiLi+1)(=H_{i}-L_{i}+1) and the parameter θΔ1\theta_{\Delta}\geq 1 controls the amount of shift Δ\Delta as follows:

W=F/θW,Δ=W/θΔF/(θWθΔ),\displaystyle W=\lceil F/\theta_{W}\rceil,\qquad\Delta=\lceil W/\theta_{\Delta}\rceil\approx F/(\theta_{W}\theta_{\Delta}), (16)

where \lceil\cdot\rceil is the ceiling function. That is, the width of each subband WW is 1/θW1/\theta_{W} times the entire number of frequency bins FF. The subband is shifted by the amount of another 1/θΔ1/\theta_{\Delta} times the bandwidth WW.

To ensure that all the frequencies, including the highest and the lowest, are updated by the same number, we set the first bounds as follows:

(L1,H1)={(ΔW+1,Δ)(Up)(FΔ+1,FΔ+W)(Down).(L_{1},H_{1})=\begin{cases}(\Delta-W+1,\Delta)&(\textrm{Up})\\ (F-\Delta+1,F-\Delta+W)&(\textrm{Down}).\end{cases} (17)

Using Eq. (17), all the frequency bins are included in θΔ\theta_{\Delta} subbands. Namely, for every frequency (including the lowest and the highest), BSS methods are applied by the same number, i.e., θΔ\theta_{\Delta} times222 This may not be true when θΔ\theta_{\Delta} is neither an integer nor a divisor of WW. For example, when θΔ=2.5\theta_{\Delta}=2.5, BSS methods are applied either 2 (=θΔ)({}=\lfloor\theta_{\Delta}\rfloor) or 3 (=θΔ)({}=\lceil\theta_{\Delta}\rceil) times for each frequency bin, where \lfloor\cdot\rfloor denotes the flooring function. . Note that the first and the last bounds overflow the entire frequency band {1,,F}\{1,\ldots,F\}, i.e., Li<1L_{i}<1 or Hi>FH_{i}>F holds for smaller and larger ii. Such bounds are clipped to the range of 11 to FF when each subband i\mathcal{F}_{i} is calculated (see Eq. (8)), and then the corresponding subbands become narrower than WW. For instance, the first subband 1\mathcal{F}_{1} and (when θWθΔ\theta_{W}\theta_{\Delta} is a divisor of FF) the last subband I\mathcal{F}_{I} have only Δ\Delta frequency bins.

Figure 4 illustrates an example of the subbands generated by the above rule. Here, the subband parameter is set to (θW,θΔ)=(3,2)(\theta_{W},\theta_{\Delta})=(3,2), i.e., the bandwidth is W=F/3W=F/3, and the amount of shift is Δ=F/6\Delta=F/6. All the frequency bins are included in 2 (=θΔ)({}=\theta_{\Delta}) subbands. The subband splitting with the shift rule is summarized in Alg. 1, where any BSS method can be used for 𝖡𝖲𝖲()\mathsf{BSS}(\cdot).

Refer to caption
Figure 4: Subbands generated by Eqs. (15), (16), and (17). The horizontal axis indicates the frequency index f=1,,Ff=1,\ldots,F. The bounds (Li,Hi)(L_{i},H_{i}) are indicated with the rounded edges of the bars, and its corresponding subband i\mathcal{F}_{i} is shown by their filled area. The subband parameter was set to (θW,θΔ)=(3,2)(\theta_{W},\theta_{\Delta})=(3,2). Note that all the frequency bins are included in 22 (=θΔ)({}=\theta_{\Delta}) subbands.
Input: 𝐱\mathbf{x}, 𝖡𝖲𝖲()\mathsf{BSS}(\cdot), θW\theta_{W}, θΔ\theta_{\Delta}
Initialize 𝐖,𝚯\mathbf{W},\boldsymbol{\Theta}.
Set Δ\Delta using Eq. (16), and set (L1,H1)(L_{1},H_{1}) using Eq. (17).
1{f{1,,F}L1fH1}\mathcal{F}_{1}\leftarrow\left\{f\in\{1,\ldots,F\}\mid L_{1}\leq f\leq H_{1}\right\}
i1i\leftarrow 1
while i\mathcal{F}_{i}\neq\emptyset do
       (𝐱i,𝐖i,𝚯i)𝖾𝗑𝗍𝗋𝖺𝖼𝗍i(𝐱,𝐖,𝚯)(\mathbf{x}_{\mathcal{F}_{i}},\mathbf{W}_{\!\mathcal{F}_{i}},\boldsymbol{\Theta}_{\mathcal{F}_{i}})\leftarrow\mathsf{extract}_{\mathcal{F}_{i}}(\mathbf{x},\mathbf{W},\boldsymbol{\Theta})
       (𝐲i,𝐖i,𝚯i)𝖡𝖲𝖲(𝐱i,𝐖i,𝚯i)(\mathbf{y}_{\mathcal{F}_{i}},\mathbf{W}_{\!\mathcal{F}_{i}},\boldsymbol{\Theta}_{\mathcal{F}_{i}})\leftarrow\mathsf{BSS}(\mathbf{x}_{\mathcal{F}_{i}},\mathbf{W}_{\!\mathcal{F}_{i}},\boldsymbol{\Theta}_{\mathcal{F}_{i}})
       (𝐲,𝐖,𝚯)𝗌𝗎𝖻𝗌𝗍𝗂𝗍𝗎𝗍𝖾i(𝐲i,𝐖i,𝚯i)(\mathbf{y},\mathbf{W},\boldsymbol{\Theta})\leftarrow\mathsf{substitute}_{\mathcal{F}_{i}}(\mathbf{y}_{\mathcal{F}_{i}},\mathbf{W}_{\!\mathcal{F}_{i}},\boldsymbol{\Theta}_{\mathcal{F}_{i}})
       Obtain next bound (Li+1,Hi+1)(L_{i+1},H_{i+1}) by Eq. (15)
       i+1{f{1,,F}Li+1fHi+1}\mathcal{F}_{i+1}\leftarrow\left\{f\in\{1,\ldots,F\}\mid L_{i+1}\leq f\leq H_{i+1}\right\}
       ii+1i\leftarrow i+1
Output: 𝐲\mathbf{y}
Algorithm 1 Subband splitting with shift rules

3.4 Number of Inner Iteration and Computational Cost

Two iterations appear in the proposed method: (i) the outer iteration i=1,,Ii=1,\ldots,I, where 𝖡𝖲𝖲()\mathsf{BSS}(\cdot) is sequentially applied to the subbands (i)i=1I(\mathcal{F}_{i})_{i=1}^{I}, and (ii) the inner iteration that corresponds to the iterative updates in 𝖡𝖲𝖲()\mathsf{BSS}(\cdot).

Here, we confirm the relationship between the parameter θΔ\theta_{\Delta}, the number of inner iterations JinnerJ_{\textrm{inner}}, and the total number of updates JtotalJ_{\textrm{total}}, i.e., how many times the demixing matrix 𝐖f\mathbf{W}_{f} is updated for each frequency 333 The total number of updates of the demixing matrix for each frequency will be the same because every frequency bin is included in the same number of subbands, as described in Section 3.3. . Since the proposed method runs BSS algorithms θΔ\theta_{\Delta} times duplicated for each frequency, the total number of updates increases as θΔ\theta_{\Delta} increases, i.e.,

Jtotal=θΔJinner.J_{\textrm{total}}=\theta_{\Delta}J_{\textrm{inner}}. (18)

For example, in Fig. 4, all the frequency bins are updated twice because θΔ=2\theta_{\Delta}=2.

With fixing the number of inner iterations JinnerJ_{\mathrm{inner}}, the overall computational cost becomes θΔ\theta_{\Delta} times larger due to the duplicated run of the BSS algorithm. This increase in computational cost can be canceled by dividing the number of inner iterations JinnerJ_{\mathrm{inner}} by θΔ\theta_{\Delta}. Namely, given a total number of updates JtotalJ_{\textrm{total}}, set the number of inner iterations as follows444 The increase in computational cost due to the duplicated runs of BSS algorithms is canceled using Eq. (19) when the cost of BSS algorithms depends linearly on the number of frequency bins. This condition is satisfied with AuxIVA[12] and ILRMA[23]. :

Jinner=Jtotal/θΔ.J_{\textrm{inner}}=\lceil J_{\textrm{total}}/\theta_{\Delta}\rceil. (19)

Note that the proposed method can improve memory efficiency because each subband is narrower than usual.

4 Experiment

In this section, we propose SS-IVA and SS-ILRMA by combining our subband splitting with IVA and ILRMA, respectively. We also describe a conventional subband-aware method, which we call overlapped-clique-based IVA (OC-IVA) here. We then evaluate their separation performance and computational costs using speech signals. We also evaluate the separation performance of SS-ILRMA using music signals.

4.1 Proposed Method: SS-IVA and SS-ILRMA

4.1.1 SS-IVA

The proposed SS-IVA uses AuxIVA[12] for the function 𝖡𝖲𝖲()\mathsf{BSS}(\cdot) in Alg. 1. Since 𝐖\mathbf{W} is the only variable optimized in AuxIVA, the auxiliary variable 𝚯\boldsymbol{\Theta} was omitted.

4.1.2 SS-ILRMA

The proposed SS-ILRMA uses ILRMA[23] for 𝖡𝖲𝖲()\mathsf{BSS}(\cdot) in Alg. 1. Note that ILRMA has auxiliary variables 𝚯=((𝐓n)n=1N,(𝐕n)n=1N)\boldsymbol{\Theta}=((\mathbf{T}_{n})_{n=1}^{N},(\mathbf{V}_{n})_{n=1}^{N}) for NMF, i.e., the basis matrices (𝐓n)n=1N(\mathbf{T}_{n})_{n=1}^{N} and the activation matrices (𝐕n)n=1N(\mathbf{V}_{n})_{n=1}^{N}. Therefore, the operation of 𝖾𝗑𝗍𝗋𝖺𝖼𝗍i()\mathsf{extract}_{\mathcal{F}_{i}}(\cdot) and 𝗌𝗎𝖻𝗌𝗍𝗂𝗍𝗎𝗍𝖾i()\mathsf{substitute}_{\mathcal{F}_{i}}(\cdot) for 𝚯\boldsymbol{\Theta} must be defined, as well as 𝐱\mathbf{x}, 𝐖\mathbf{W} and 𝐲\mathbf{y} in Eqs. (13) and (14). Here, we define the extraction operation for 𝚯\boldsymbol{\Theta} as follows:

𝚯i=(𝐓i,(𝐕n)n=1N),\displaystyle\boldsymbol{\Theta}_{\mathcal{F}_{i}}=(\mathbf{T}_{\mathcal{F}_{i}},(\mathbf{V}_{n})_{n=1}^{N}), (20)

where

𝐓i=((𝐭fn𝖳)fi)n=1N,\mathbf{T}_{\mathcal{F}_{i}}=((\mathbf{t}_{fn}^{\mathsf{T}})_{f\in\mathcal{F}_{i}})_{n=1}^{N}, (21)

and substitution operation to do the reverse. Namely, it extracts the part of basis matrices (𝐓n)n=1N(\mathbf{T}_{n})_{n=1}^{N} associated with the subband i\mathcal{F}_{i}, similar to demixing matrices 𝐖\mathbf{W}. At the same time, it carries over the activation matrices (𝐕n)n=1N(\mathbf{V}_{n})_{n=1}^{N} optimized in the iith subband to the (i+1)(i+1)th subband as initial values. This definition is based on the following intuition that the activation (𝐕n)n=1N(\mathbf{V}_{n})_{n=1}^{N} roughly reflects whether the nnth source is active at each time, and they become similar in two adjacent subbands. Carrying over a common activation matrices (𝐕n)n=1N(\mathbf{V}_{n})_{n=1}^{N} provides a stronger bond between the subbands and helps avoiding block permutation problems.

Refer to caption
Figure 5: SDRi and permutation consistency of each method for speech signals. The proposed methods are emphasized by (light) green and bold letters. The number of basis for ILRMA KK is indicated by 2\langle 2\rangle and 10\langle 10\rangle. Boxes for IVA-based methods contain 224 (== 56 pairs of speech sources ×\times 4 pairs of source directions) results, while those for ILRMA-based methods have 1120 (=224×5{}=224\times 5 seeds) results. Large markers show the worst cases. For the proposed SS-IVA and SS-ILRMA, the direction of the shift is indicated by \Uparrow (upward) and \Downarrow (downward).

4.2 Comparison Method: OC-IVA[18]

To compare our subband splitting with conventional subband-aware methods, we tested OC-IVA, which is an extension of IVA[18]. Similar to the proposed method, OC-IVA considers overlapping subbands to estimate demixing matrices. The significant difference with our SS-IVA is that OC-IVA separates all frequency bands simultaneously using an algorithm derived from the following objective function:

(𝐖)=n=1Nt=1Ti=1Ifi|yftn|2\displaystyle\mathcal{L}(\mathbf{W})=\sum_{n=1}^{N}\sum_{t=1}^{T}\sum_{i=1}^{I}\sqrt{\sum_{f\in\mathcal{F}_{i}}|y_{ftn}|^{2}}
2Tf=1Flog(|det(𝐖f)|),\displaystyle\quad\quad\quad\quad\quad\quad\quad\quad\quad-2T\sum_{f=1}^{F}\log(|\det(\mathbf{W}_{f})|), (22)

where we generate the subbands (i)i=1I(\mathcal{F}_{i})_{i=1}^{I} in the same way with our method. Note that the first term models the subband-wise group structure (see Eq. (4) for comparison to the original IVA). Comparison with OC-IVA helps to see whether optimizing the subbands simultaneously or sequentially yields better performance.

4.3 Experimental Settings

We evaluated BSS methods using two-channel mixtures of two sources, i.e., N=M=2N=M=2. The speech and music signals were obtained from SiSEC 2011[40] dataset. For speech signals (Section 4.4 and 4.5), 8 speech signals were obtained from dev1 in underdetermined speech and music mixtures task[40]. Then, all 56 possible pairs (considering their permutations) were used as source signals. Their duration was 10 s, and the sampling frequency was 16 kHz. For music signals (Section 4.6), we used 4 songs from professionally produced music recordings task[40]. For each song, two instruments (including vocals, guitars, violins, and synth) were chosen as in [23, Table 3] (the paper proposed ILRMA). They were downsampled to 16 kHz.

The observed signals were generated by convolving room impulse responses in [41] with downsampling to 16 kHz. The reverberation time was 160 ms. The pair of source directions were (45,30)(-45^{\circ},30^{\circ}), (75,30)(-75^{\circ},30^{\circ}), (45,60)(-45^{\circ},60^{\circ}) and (75,60)(-75^{\circ},60^{\circ}). The spacing between the two microphones was 8 cm, and the distance between the sources and the center of the microphones was 1 m.

The demixing matrices 𝐖f\mathbf{W}_{\!f} were initialized with identity matrices. All elements of the initial value for (𝐓n)n=1N(\mathbf{T}_{n})_{n=1}^{N} and (𝐕n)n=1N(\mathbf{V}_{n})_{n=1}^{N} were drawn from the uniform distribution 𝒰(0,1)\mathcal{U}(0,1), where five different seeds were used.

The window length and the hop size of STFT were set to 2048 and 1024 samples, respectively. The Hann window was used for the window function. The Nyquist frequency was omitted from the optimization target, i.e., the number of frequency bins FF was 10241024.

The source-to-distortion ratio improvement (SDRi) [43] was used to evaluate separation performance. In addition, we define permutation consistency, a weighted accuracy of permutation, as follows:

Permutation Consistency=\displaystyle\text{Permutation Consistency}={} (23)
max1pN!(f=1Fγfδp,qff=1Fγf)×100,\displaystyle\quad\quad\quad\quad\quad\quad\max_{1\leq p\leq N!}\left(\frac{\textstyle\sum_{f=1}^{F}\gamma_{f}\cdot\delta_{p,q_{f}}}{\textstyle\sum_{f=1}^{F}\gamma_{f}}\right)\times 100,

where γf\gamma_{f} is the frequency-wise weight given by the power of each frequency, i.e.,

γf=1NTn=1Nt=1T|sftn|2,\gamma_{f}=\frac{1}{NT}\sum_{n=1}^{N}\sum_{t=1}^{T}|s_{ftn}|^{2}, (24)

δij\delta_{ij} is Kronecker’s delta,

δij={1(i=j)0(ij).\delta_{ij}=\begin{cases}1&(i=j)\\ 0&(i\neq j).\end{cases} (25)

qfq_{f} is the index of correct permutation that achieves the highest correlation between the source and separated signal at each frequency, i.e.,

qf\displaystyle q_{f} =argmax1qN!n=1Nt=1T|sftnyft(σqn)|,\displaystyle=\underset{1\leq q^{\prime}\leq N!}{\mathrm{arg~{}max}}~{}\sum_{n=1}^{N}\sum_{t=1}^{T}|s_{ftn}\,y_{ft(\sigma_{q^{\prime}n})}|, (26)

where σqn{1,,N}\sigma_{qn}\in\{1,\ldots,N\}, 𝝈q=(σq1,,σqN)SN\boldsymbol{\sigma}_{q}=(\sigma_{q1},\ldots,\sigma_{qN})\in S_{N} is the qqth permutation (i.e., {𝝈1,,𝝈N!}=SN\{\boldsymbol{\sigma}_{1},\ldots,\boldsymbol{\sigma}_{N!}\}=S_{N}), and SNS_{N} is the set of all permutations of NN elements. Permutation consistency takes the values between 1/(N!)1/(N!) and 11, and the higher, the better. The experiments were performed with MATLAB R2024b on AMD Ryzen 9 5950X (3.40 GHz, 16 core).

Table 1: Average SDRi and permutation consistency for speech mixture. Each column corresponds to a subband parameter (θW,θΔ)(\theta_{W},\theta_{\Delta}), and the best value is bolded. For the proposed SS-IVA and SS-ILRMA, the directions are indicated by \Uparrow (upward) and \Downarrow (downward).
Method (\star: ours) KK SDRi [dB] Permutation consistency [%]
(2,2)(2,2) (2,4)(2,4) (4,2)(4,2) (4,4)(4,4) (2,2)(2,2) (2,4)(2,4) (4,2)(4,2) (4,4)(4,4)
IVA Vanilla [12] 6.11 85.42
OC [18] 10.05 9.13 6.94 9.76 96.16 92.97 88.3 94.69
SS \Uparrow \star 10.09 10.34 10.82 10.90 96.25 96.81 97.55 97.77
SS \Downarrow \star 10.73 11.00 10.96 10.53 97.29 97.65 97.60 95.98
ILRMA Vanilla [23] 2 8.90 91.71
Vanilla [23] 10 8.56 89.91
SS \Uparrow \star 2 10.01 11.29 11.50 10.82 94.79 97.75 98.09 96.46
SS \Uparrow \star 10 8.91 11.06 11.03 9.84 92.02 96.90 96.93 93.97
SS \Downarrow \star 2 11.25 11.67 12.06 11.95 97.91 98.78 99.60 99.49
SS \Downarrow \star 10 10.57 10.97 11.71 11.79 96.15 97.21 98.86 99.12
FDICA + IPS 12.20 100.00

4.4 Separation Performance for Speech Signals

Refer to caption
Figure 6: Example of a speech signal. The signal is obtained from dev1_male4_src_2.wav in [40]. The vertical axis ranges from 0 to 8 kHz, and the highest subband (A) and lowest subband (B) are highlighted. The subband parameters are set to (θW,θΔ)=(2,2)(\theta_{W},\theta_{\Delta})=(2,2). The middle-frequency band is darkened to emphasize the highest and lowest subbands.

We evaluated the separation performance of BSS methods using speech signals. The total number of updates JtotalJ_{\textrm{total}} was fixed to 100. The parameters θW\theta_{W} and θΔ\theta_{\Delta} were set to 2 or 4 to ensure that θΔ{2,4}\theta_{\Delta}\in\{2,4\} is a divisor of 100 (= JtotalJ_{\textrm{total}}) and that θWθΔ{2,4,8}\theta_{W}\theta_{\Delta}\in\{2,4,8\} is a divisor of 1024 (= FF). The number of bases for ILRMA KK was set to 2 and 10. We also tested FDICA + IPS, the oracle method that individually separates each frequency via FDICA and solves the permutation via IPS, where IPS finds the ideal permutation using Eq. (26).

Figure 5 and Table 1 summarize their separation performance, where Vanilla refers to the original IVA or ILRMA handling all frequencies simultaneously. The experimental results show the notable improvements brought by the proposed subband splitting.

Both the conventional OC-IVA and our proposed SS-IVA outperformed Vanilla IVA, confirming that band splitting enhances separation performance. In all the settings, our SS-IVA (sequential separation) consistently outperformed OC-IVA (simultaneous separation). Notably, the permutation consistency of SS-IVA was significantly high compared to OC-IVA, which demonstrates that the sequential procedure contributes to aligning permutation.

For ILRMA, the proposed SS-ILRMA obtained the highest performance for almost all subband parameters. Downward SS-ILRMA with (K,θW,θΔ)=(2,4,2)(K,\theta_{W},\theta_{\Delta})=(2,4,2) achieved the best result in the experiment. Remarkably, its separation performance was comparable to FDICA + IPS, and the permutation consistency was almost perfect (99.60 on average). Moreover, even under the worst case (indicated by a large marker), it resulted in 6.226.22 dB in SDRi, which was 7.887.88 dB higher than Vanilla ILRMA (1.66-1.66 dB). Note that the average performance of ILRMA and SS-ILRMA tended to be better when K=2K=2 rather than when K=10K=10. At the same time, when (θW,θΔ)=(4,4)(\theta_{W},\theta_{\Delta})=(4,4), the downward SS-ILRMA with K=10K=10 achieved similar performance as K=2K=2 (see Fig. 5), indicating that splitting into smaller subbands reduces the size of the problem within each subband and helps the algorithms to work stably.

The downward shift tended to be superior to the upward shift, which was particularly noticeable in SS-ILRMA. That is, starting with the separation of higher frequency bands was more effective. This is likely due to the sparsity (an important cue for many BSS algorithms) of the speech signals in higher frequency bands. To illustrate this, Fig. 6 shows a speech signal used in the experiment along with the (A) highest and (B) lowest subbands, where the subband parameter was set to (θW,θΔ)=(2,2)(\theta_{W},\theta_{\Delta})=(2,2). When using the upward shift, the lowest band (B) must be separated first. However, this subband is relatively dense, making it difficult for sparsity-based BSS methods. Failures at this subband negatively affect the separation of subsequent subbands. On the other hand, when the downward shift is used, the highest subband (A) is separated first. Owing to its sparsity and simple structure (i.e., components concentrated in specific time frames and appearing as vertical patterns), this subband can be successfully separated using IVA and ILRMA. The results from the higher subbands are then propagated to the lower (i.e., more challenging) subbands and help their separation, leading to better outcomes.

4.5 Computational Costs

Refer to caption
Figure 7: RTF with various subband parameters. The proposed methods are emphasized by (light) green and bold letters. The downward shift was used for SS-IVA and SS-ILRMA. The number of basis KK for ILMRA is set to 22, as indicated by 2\langle 2\rangle.
Refer to caption

(a) Per number of updates.
Refer to caption (b) Per RTF.

Figure 8: Transition of average SDRi per (a) total number of updates JtotalJ_{\mathrm{total}} and (b) RTF for speech signals. Markers are shown for each 20 updates. The proposed methods are emphasized by (light) green and bold letters. The downward shift was used for SS-IVA and SS-ILRMA. The number of basis KK for ILMRA and SS-ILRMA was set to 22, as indicated by 2\langle 2\rangle. The best subband parameters (θW,θΔ)(\theta_{W},\theta_{\Delta}) in Table 1 was used for each method. The separation performance of FDICA+IPS is shown by the horizontal line.

Next, we investigate the computational costs of the proposed technique using the same mixtures as in Section 4.4 We run each BSS method 20 times to measure the real-time factor (RTF), i.e., the normalized runtime required to process one second of the signal. To check the implementations, we also tested SS-IVA and SS-ILRMA with (θW,θΔ)=(1,1)(\theta_{W},\theta_{\Delta})=(1,1), which agrees with the Vanilla version of IVA and ILRMA. The total number of updates JtotalJ_{\textrm{total}} was fixed to 100. The downward shift was used, and the number of bases KK was set to 22 since SS-IVA and SS-ILRMA were found to perform better with this setting (see Section 4.4).

The results are shown in Fig. 7. The conventional OC-IVA and the proposed SS-IVA took slightly longer runtime than Vanilla IVA due to band splitting. Their RTF increased as the subband parameters θW\theta_{W} and θΔ\theta_{\Delta} grew. For ILRMA, the subband splitting was found to reduce the computation time. It was also observed that the runtime in SS-ILRMA was primarily influenced by θW\theta_{W}, which determines the bandwidth WW, and was less dependent on θΔ\theta_{\Delta}, which controls the amount of shift Δ\Delta.

We also examined the relationship between computational time and separation performance. The total number of updates JtotalJ_{\textrm{total}} was varied from 0 to 100 in increments of 4. For each method, the best subband parameters (θW,θΔ)(\theta_{W},\theta_{\Delta}) in Table 1 were used: (2,2)(2,2) for OC-IVA, (2,4)(2,4) for downward SS-IVA, and (4,2)(4,2) for downward SS-ILRMA with K=2K=2.

Figure 8 (a) shows the average separation performance against the total number of updates JtotalJ_{\textrm{total}}. The separation performance of conventional IVA and ILRMA grew gradually, which is due to some mixtures that require many iterations for resolving the permutation problem. On the other hand, the proposed SS-IVA and SS-ILRMA rapidly converged in fewer iterations. SS-IVA reached the ceiling of the separation performance around Jtotal=12J_{\textrm{total}}=12 (=θΔ×Jinner=4×3)({}=\theta_{\Delta}\times J_{\textrm{inner}}=4\times 3), and SS-ILRMA around Jtotal=8J_{\textrm{total}}=8 (=θΔ×Jinner=2×4)({}=\theta_{\Delta}\times J_{\textrm{inner}}=2\times 4). The required number of updates JtotalJ_{\textrm{total}} (i.e., 12 and 8) is less than half compared to that in the previous literature, e.g., [23], thanks to the ability to avoid erroneous permutation.

Figure 8 (b) shows the average separation performance versus RTF. The proposed SS-IVA and SS-ILRMA were confirmed to converge with shorter runtimes than conventional methods.

Refer to caption
Figure 9: SDRi of each method for music signals. The proposed methods are emphasized by (light) green and bold letters. The number of basis for ILRMA KK is indicated by 2\langle 2\rangle and 10\langle 10\rangle. Each box contains 80 (== 4 pairs of music sources ×\times 4 pairs of source directions ×\times 5 seeds) results. Large markers show the worst cases. For the proposed SS-ILRMA, the direction of the shift is indicated by \Uparrow (upward) and \Downarrow (downward).

4.6 Separation Performance for Music Signals

From the experimental results so far, we have confirmed that, for speech signals, SS-ILRMA is highly effective. This might be a surprising outcome because ILRMA is known to be unsuitable for speech signals but suitable for music signals. Therefore, we additionally evaluated SS-ILRMA on music signals. The parameters are the same as in Section 4.4

Figure 9 and Table 2 show the separation performance of ILRMA, SS-ILRMA, and FDICA+IPS. When the number of basis KK was set to 10, the Vanilla ILRMA performed worse in some cases555 When only evaluating the song ID4 in [23, Table 3] (i.e., guitar and synth in [23, Fig. 12] that investigates the relationship between KK and SDRi), the average SDRi of Vanilla ILRMA was 16.24 dB (K=2K=2) and 16.62 dB (K=10K=10), i.e., K=10K=10 outperformed K=2K=2, which agrees with [23, Fig. 12]. The poor performance of Vanilla ILRMA with K=10K=10 in Fig. 9 is due to the other mixtures, especially ID2 and ID3. . The upward SS-ILRMA was ineffective for music signals, while the downward SS-ILRMA performed robustly for both K=2K=2 and K=10K=10. In particular, the downward SS-ILRMA with (K,θW,θΔ)=(10,2,2)(K,\theta_{W},\theta_{\Delta})=(10,2,2) yielded the best results in terms of the median SDRi, where the worst-case SDRi was improved by 7.51 dB compared to Vanilla ILRMA with K=10K=10. Note that downward SS-ILRMA with (K,θW,θΔ)=(2,2,2)(K,\theta_{W},\theta_{\Delta})=(2,2,2), (2,4,2)(2,4,2), and (10,2,2)(10,2,2) even outperformed FDICA+IPS on average, which might be because SS-ILRMA can more precisely model the structure of the power spectrogram (including the relationship among the time-frequency bins) by using NMF, while FDICA handles the frequency bins independently.

Table 2: Average SDRi of ILRMA (Vanilla), SS-ILRMA (SS) and FDICA+IPS for music signals. Each column corresponds to a subband parameter (θW,θΔ)(\theta_{W},\theta_{\Delta}), and the best value is bolded. For the proposed SS-ILRMA, the directions are indicated by \Uparrow (upward) and \Downarrow (downward).
Method KK SDRi [dB]
(\star: ours) (2,2)(2,2) (2,4)(2,4) (4,2)(4,2) (4,4)(4,4)
Vanilla [23] 2 11.65
Vanilla [23] 10 9.99
SS \Uparrow \star 2 9.75 7.02 6.96 6.97
SS \Uparrow \star 10 9.75 4.67 4.66 5.75
SS \Downarrow \star 2 11.89 11.48 12.32 11.59
SS \Downarrow \star 10 12.29 11.53 11.61 10.85
FDICA+IPS 11.71
Refer to caption
Figure 10: Example of a music signal (guitar). The signal is obtained from dev1__bearlin-roads__snip_85_99 __acoustic_guit_main.wav in [40]. The vertical axis ranges from 0 to 8 kHz, and the highest subband (A) and lowest subband (B) are highlighted. The subband parameters are set to (θW,θΔ)=(2,2)(\theta_{W},\theta_{\Delta})=(2,2). The middle-frequency band is darkened to emphasize the highest and lowest subbands.

The difference between upward and downward shifts was more pronounced in music signals compared to that in speech signals (Section 4.4). This should be because the separation of lower frequency bands in music signals can be significantly challenging for some instruments. As an example, Fig. 10 shows the signal of the guitar used in the experiment. Similar to speech signals, the highest subband (A) is sparse. On the contrary, the lowest subband (B) is even denser than that of speech signals. Therefore, the downward shift, i.e., leveraging the separation result obtained from higher subbands to assist in the separation of lower subbands, is indispensable when using SS-ILRMA for music signals.

5 Conclusion

In this paper, we proposed subband splitting, a technique for improving the performance of existing BSS algorithms. It sequentially applies a BSS algorithm to several overlapping subbands. Despite its simplicity, the proposed method effectively improves the separation performance. Additionally, we proposed SS-IVA and SS-ILRMA by incorporating IVA and ILRMA into our technique. Experimental results showed that our downward SS-ILRMA reached significantly higher separation performance with rapid convergence. Future work includes integrating more advanced BSS methods (e.g., those utilizing deep learning) for further improvement of separation performance. It is also valuable to extend the proposed technique to real-time processing scenarios by leveraging its fast convergence.

References

  • [1] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Neurocomputing, 22(1), 21–34 (1998).
  • [2] N. Murata, S. Ikeda and A. Ziehe, “An approach to blind source separation based on temporal structure of speech signals,” Neurocomputing, 41(1), 1–24 (2001).
  • [3] H. Sawada, R. Mukai, S. Araki and S. Makino, “A robust and precise method for solving the permutation problem of frequency-domain blind source separation,” IEEE Trans. Speech Audio Process., 12(5), 530–538 (2004).
  • [4] L. Wang, H. Ding and F. Yin, “A region-growing permutation alignment approach in frequency-domain blind source separation of speech mixtures,” IEEE Trans. Audio Speech Lang. Process., 19(3), 549–557 (2011).
  • [5] A. Sarmiento, I. Durán-DÃaz, A. Cichocki and S. Cruces, “A contrast function based on generalized divergences for solving the permutation problem in convolved speech mixtures,” IEEE/ACM Trans. Audio Speech Lang. Process., 23(11), 1713–1726 (2015).
  • [6] S. Yamaji and D. Kitamura, “DNN-based permutation solver for frequency-domain independent component analysis in two-source mixture case,” Asia-Pac. Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA ASC), pp. 781–787 (2020).
  • [7] F. Hasuike, D. Kitamura and R. Watanabe, “DNN-based frequency-domain permutation solver for multichannel audio source separation,” Asia-Pac. Signal Inf. Process. Assoc. Annu. Summit Conf. (APSIPA ASC), pp. 871–876 (2022).
  • [8] L. Li, H. Kameoka and S. Seki, “HBP: An efficient block permutation solver using Hungarian algorithm and spectrogram inpainting for multichannel audio source separation,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 516–520, IEEE (2022).
  • [9] S. Emura, “Permutation-alignment method using manifold optimization for frequency-domain blind source separation,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 601–605 (2024).
  • [10] A. Hiroe, “Solution of permutation problem in frequency domain ICA, using multivariate probability density functions,” Independent Component Analysis and Blind Signal Separation, pp. 601–608 (2006).
  • [11] T. Kim, I. Lee and T.-W. Lee, “Independent vector analysis: Definition and algorithms,” Fortieth Asilomar Conf. Signals Syst. Comput., pp. 1393–1396 (2006).
  • [12] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA), pp. 189–192 (2011).
  • [13] R. Scheibler and N. Ono, “Fast and stable blind source separation with rank-1 updates,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 236–240 (2020).
  • [14] K. Yatabe and D. Kitamura, “Time-frequency-masking-based determined BSS with application to sparse IVA,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 715–719 (2019).
  • [15] K. Yatabe and D. Kitamura, “Determined BSS based on time-frequency masking and its application to harmonic vector analysis,” IEEE/ACM Trans. Audio Speech Lang. Process., 29, 1609–1625 (2021).
  • [16] Y. Liang, S. M. Naqvi and J. Chambers, “Overcoming block permutation problem in frequency domain blind source separation when using AuxIVA algorithm,” Electron. Lett., 48(8), 460 – 462 (2012).
  • [17] G.-J. Jang, I. Lee and T.-W. Lee, “Independent vector analysis using non-spherical joint densities for the separation of speech signals,” IEEE Int. Conf. Acoust. Speech Signal Process., Vol. 2, pp. II–629–II–632 (2007).
  • [18] I. Lee and G.-J. Jang, “Independent vector analysis based on overlapped cliques of variable width for frequency-domain blind signal separation,” EURASIP J. Adv. Signal Process., 2012(1), 113 (2012).
  • [19] R. Ikeshita, Y. Kawaguchi, M. Togami, Y. Fujita and K. Nagamatsu, “Independent vector analysis with frequency range division and prior switching,” Eur. Signal Process. Conf. (EUSIPCO), pp. 2329–2333 (2017).
  • [20] U.-H. Shin and H.-M. Park, “Auxiliary-function-based independent vector analysis using generalized inter-clique dependence source models with clique variance estimation,” IEEE Access, 8, 68103–68113 (2020).
  • [21] L. Li and K. Koishida, “Geometrically constrained independent vector analysis for directional speech enhancement,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 846–850 (2020).
  • [22] K. Goto, T. Ueda, L. Li, T. Yamada and S. Makino, “Geometrically constrained independent vector analysis with auxiliary function approach and iterative source steering,” Eur. Signal Process. Conf. (EUSIPCO), pp. 757–761 (2022).
  • [23] D. Kitamura, N. Ono, H. Sawada, H. Kameoka and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Trans. Audio Speech Lang. Process., 24(9), 1626–1641 (2016).
  • [24] H. Sawada, N. Ono, H. Kameoka, D. Kitamura and H. Saruwatari, “A review of blind source separation methods: Two converging routes to ILRMA originating from ICA and NMF,” APSIPA Trans. Signal Inf. Process., 8(1), (2019).
  • [25] D. Kitamura and K. Yatabe, “Consistent independent low-rank matrix analysis for determined blind source separation,” EURASIP J. Adv. Signal Process., 2020(1), 46 (2020).
  • [26] S. Mogami, D. Kitamura, Y. Mitsui, N. Takamune, H. Saruwatari and N. Ono, “Independent low-rank matrix analysis based on complex Student’s t-distribution for blind audio source separation,” IEEE Int. Workshop Mach. Learn. Signal Process. (MLSP), pp. 1–6 (2017).
  • [27] S. Mogami, Y. Mitsui, N. Takamune, D. Kitamura, H. Saruwatari, Y. Takahashi, K. Kondo, H. Nakajima and H. Kameoka, “Independent low-rank matrix analysis based on generalized Kullback-Leibler divergence,” IEICE Trans. Fundam. Electron. Commun. Comput. Sci., E102.A(2), 458–463 (2019).
  • [28] Y. Mitsui, N. Takamune, D. Kitamura, H. Saruwatari, Y. Takahashi and K. Kondo, “Vectorwise coordinate descent algorithm for spatially regularized independent low-rank matrix analysis,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 746–750 (2018).
  • [29] F. Oshima, M. Nakano and D. Kitamura, “Interactive speech source separation based on independent low-rank matrix analysis,” Acoust. Sci. Technol., 42(4), 222–225 (2021).
  • [30] R. Ikeshita and Y. Kawaguchi, “Independent low-rank matrix analysis based on multivariate complex exponential power distribution,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 741–745 (2018).
  • [31] N. Makishima, S. Mogami, N. Takamune, D. Kitamura, H. Sumino, S. Takamichi, H. Saruwatari and N. Ono, “Independent deeply learned matrix analysis for determined audio source separation,” IEEE/ACM Trans. Audio Speech Lang. Process., 27(10), 1601–1615 (2019).
  • [32] T. Hasumi, T. Nakamura, N. Takamune, H. Saruwatari, D. Kitamura, Y. Takahashi and K. Kondo, “Empirical Bayesian independent deeply learned matrix analysis for multichannel audio source separation,” Eur. Signal Process. Conf. (EUSIPCO), pp. 331–335 (2021).
  • [33] T. Hasumi, T. Nakamura, N. Takamune, H. Saruwatari, D. Kitamura, Y. Takahashi and K. Kondo, “PoP-IDLMA: Product-of-prior independent deeply learned matrix analysis for multichannel music source separation,” IEEE/ACM Trans. Audio Speech Lang. Process., 31, 2680–2694 (2023).
  • [34] H. Kameoka, L. Li, S. Inoue and S. Makino, “Supervised determined source separation with multichannel variational autoencoder,” Neural Comput., 31(9), 1891–1914 (2019).
  • [35] L. Li, H. Kameoka and S. Makino, “Fast MVAE: Joint separation and classification of mixed sources based on multichannel variational autoencoder with auxiliary classifier,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 546–550 (2019).
  • [36] L. Li, H. Kameoka and S. Makino, “FastMVAE2: On improving and accelerating the fast variational autoencoder-based source separation algorithm for determined mixtures,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., 31, 96–110 (2022).
  • [37] R. Scheibler and M. Togami, “Surrogate source model learning for determined source separation,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 176–180 (2021).
  • [38] K. Matsumoto and K. Yatabe, “Determined BSS by combination of IVA and DNN via proximal average,” IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 871–875 (2024).
  • [39] K. Matsumoto and K. Yatabe, “Subband splitting: Simple, efficient and effective technique for solving block permutation problem in determined blind source separation,” arXiv:2409.09294 (2024).
  • [40] S. Araki, F. Nesta, E. Vincent, Z. Koldovský, G. Nolte, A. Ziehe and A. Benichoux, “The 2011 signal separation evaluation campaign (SiSEC2011): - audio source separation -,” Latent Var. Anal. Signal Sep., pp. 414–422 (2012).
  • [41] E. Hadad, F. Heese, P. Vary and S. Gannot, “Multichannel audio database in various acoustic environments,” Int. Workshop Acoust. Signal Enhanc. (IWAENC), pp. 313–317 (2014).
  • [42] C. Févotte, N. Bertin and J.-L. Durrieu, “Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis,” Neural Comput., 21(3), 793–830 (2009).
  • [43] E. Vincent, R. Gribonval and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Trans. Audio Speech Lang. Process., 14(4), 1462–1469 (2006).
\profile

Kazuki Matsumotois currently an undergraduate student at Waseda University. His research interests include optimization-based signal processing.

\profile

Kohei Yatabereceived the B.E., M.E., and Ph.D. degrees from Waseda University in 2012, 2014, and 2017, respectively. He is currently an Associate Professor at the Department of Electrical Engineering and Computer Science at the Tokyo University of Agriculture and Technology.