A Perceptual Weighting Filter Loss
for DNN Training in Speech Enhancement

Abstract

Single-channel speech enhancement with deep neural networks (DNNs) has shown promising performance and is thus intensively being studied. In this paper, instead of applying the mean squared error (MSE) as the loss function during DNN training for speech enhancement, we design a perceptual weighting filter loss motivated by the weighting filter as it is employed in analysis-by-synthesis speech coding, e.g., in code-excited linear prediction (CELP). The experimental results show that the proposed simple loss function improves the speech enhancement performance compared to a reference DNN with MSE loss in terms of perceptual quality and noise attenuation. The proposed loss function can be advantageously applied to an existing DNN-based speech enhancement system, without modification of the DNN topology for speech enhancement.

Index Terms— Speech enhancement, deep neural networks, speech coding

1 Introduction

Single-channel speech enhancement aims to improve the quality and intelligibility of a speech signal degraded by additive noise, where the noisy mixture signal from only one microphone is available. As a widely researched problem, quite a number of contributions have been made over past decades, including conventional and data-driven speech enhancement approaches. Data-driven approaches have shown promising performance under non-stationary noise conditions, and they receive increasing research attention [1]. Therein, a regression for the spectral weighting rule indexed by the a posteriori signal-to-noise ratio (SNR) and a priori SNR for each frequency bin is trained and shows superior performance compared to conventional approaches [2, 3]. As deep neural networks (DNNs) provide an effective way for supervised learning, DNN-based speech enhancement is intensively studied [4, 5, 6, 7]. DNNs are trained to classify the noisy mixture signal into speech or noise for each frequency bin, known as ideal binary mask (IBM) [4]. Furthermore, some ratio masks are proposed as the targets for DNN training, e.g., the ideal ratio mask (IRM) [5] and fast Fourier transform (FFT) mask [6]. Those masks are usually trained directly by minimizing the error between the estimated mask and the oracle mask, which serves as a target. However, it is also possible to indirectly train the mask by introducing a multiplicative layer for the microphone signal during training, which allows to optimize the loss between clean speech targets and estimated clean speech. This has shown superior performance [8]. Besides, DNNs are used to directly estimate the clean speech from the noisy speech by a regression [7].

Loss functions play a key role in DNN training for speech enhancement, and the mean squared error (MSE) is a straightforward choice. The MSE loss is mostly applied in the same domain as the network’s input [7], but it can be also computed in a different domain to exploit additional domain knowledge [9]. Instead of using the MSE loss, some other loss functions are designed to optimize the perceptual metrics during the training, e.g., short-time objective intelligibility (STOI) [10] and/or perceptual evaluation of speech quality (PESQ) [11] are taken into account as loss functions. However, since the computation of both metrics is non-differentiable, some approximations have to be introduced in order to obtain fully differentiable loss functions for backpropagation [12, 13, 14, 15, 16]. As an alternative, in [17], the policy gradient method is applied on the basis of a sampling algorithm, leading to signals which achieve higher metric scores. Furthermore, in [18], the estimation of a metric score can be obtained by the discriminator of a generative adversarial network, and the generator then uses the prediction to decide for a gradient direction for optimization. Also, some perceptual weighting rules are proposed in loss functions for neural network training to achieve a better perceptual quality, e.g., high-energy frequency areas are emphasized in the loss functions in [19, 20]. However, a perceptual model is not considered in [19] and the loss function proposed in [20] contains an empirical function to balance boosting high energy components and suppressing low energy components. Besides, the absolute threshold of hearing and masking principles from psychoacoustics [21] are applied to construct the loss function in [22], where a slight improvement over using the MSE loss is found at very low SNR levels. In [23], the authors propose a multi-target training that incorporates a perceptual weighting loss. However, the approach requires to change the network topology by adding additional output nodes, which results in a doubling of output nodes in [23]. This hinders the straightforward integration of the concept into existing frameworks as not only additional training targets have to be extracted, but also the topology has to be altered.

In this paper, two perceptual weighting filters from code-excited linear predictive (CELP) speech coding, as in, e.g., adaptive multi-rate (AMR) [24], wideband AMR (AMR-WB) [25], and enhanced voice services (EVS) [26], are applied to design the loss function for DNN training in speech enhancement. The perceptual weighting filter is originally used to shape the coding noise / quantization error to be less audible by exploiting the masking property of the human ear [27]. Motivated by the success of the weighting filter when used to shape coding noise, we apply it in this work to design the loss function to achieve improved perceptual quality in the context of speech enhancement, aiming to combat acoustic background noise (instead of coding noise, as has been done in [28]). We propose to extract the frequency response of the weighting filter from the clean speech and directly apply it to the loss function, resulting in an unaltered DNN architecture with no need to change the topology as in [23]. This results in an approach which is easy to integrate into existing frameworks.

The paper is structured as follows: In Section 2 we briefly introduce the reference DNN-based speech enhancement with MSE loss. In Section 3 we describe the adopted weighting filter and how it is applied to the loss function in DNN training. Section 4 presents the experimental setup, the evaluation results, and the discussion. Finally, some conclusions are drawn in Section 5.

2 Reference DNN with MSE LOSS

\psfrag{A}[cc][cc]{\footnotesize$S_{\ell}\!(k)$}\psfrag{B}[cc][cc]{\footnotesize$D_{\ell}\!(k)$}\psfrag{C}[cc][cc]{\footnotesize$Y_{\ell}\!(k)$}\psfrag{D}[cc][cc]{\footnotesize$|Y_{\ell}\!(k)|$}\psfrag{E}[cc][cl]{\footnotesize$[|Y_{\ell\!-\!2}\!(k)|,\!\ldots,\!|Y_{\ell\!+\!2}\!(k)|]$}\psfrag{F}[cc][cc]{\footnotesize$|M_{\ell}\!(k)|$}\psfrag{G}[cc][cc]{\footnotesize$\widehat{|S_{\ell}\!(k)|}$}\psfrag{H}[cc][cc]{\footnotesize$E_{\ell}\!(k)$}\psfrag{I}[cc][cc]{\footnotesize$E^{\text{w}}_{\ell}\!(k)$}\psfrag{J}[cc][cc]{\footnotesize$|S_{\ell}\!(k)|$}\psfrag{K}[cc][cc]{\footnotesize$a_{\ell}\!(k)$}\psfrag{L}[cc][cc]{\footnotesize$W_{\ell}\!(z)$}\psfrag{M}[cc][cc]{\footnotesize$W_{\ell}\!(k)$}\psfrag{N}[cc][cc]{\footnotesize$|W\!_{\ell}\!(k)|$}\psfrag{O}[cc][cc]{\footnotesize Delay\& }\psfrag{P}[cc][cc]{\footnotesize Reshape}\psfrag{Q}[cc][cc]{\footnotesize DNN}\psfrag{R}[cc][cc]{\footnotesize Scaling}\psfrag{S}[cc][cc]{\footnotesize$\left\|\cdot\right\|^{2}_{2}$}\psfrag{T}[cc][cc]{\footnotesize min}\psfrag{U}[cb][rb]{\footnotesize MSE}\psfrag{z}[cc][rb]{\footnotesize optimization}\psfrag{V}[cc][cc]{\footnotesize LP}\psfrag{W}[cc][cc]{\footnotesize Analysis}\psfrag{X}[cc][cc]{\footnotesize Weighting}\psfrag{Y}[cc][cc]{\footnotesize Filter}\psfrag{Z}[cc][cc]{\footnotesize$z\!\!\rightarrow\!\!e^{j\!\frac{2\pi k}{K}}$}\psfrag{a}[cc][cc]{\footnotesize Weighting filter amplitude response}\psfrag{b}[cc][cc]{\footnotesize$\gamma_{1}$}\psfrag{c}[cc][cc]{\footnotesize$\gamma_{2}$}\psfrag{w}[cc][cc]{\footnotesize Enhanced speech spectral amplitudes}\psfrag{d}[cc][cc]{\footnotesize$s_{\ell}(n)$}\psfrag{e}[cc][cc]{\footnotesize FFT}\includegraphics[width=243.09818pt]{weighting_filter_loss_maskDNN_training_new_nol2norm.eps}

Figure 1: Training stage for DNN-based speech enhancement with perceptual weighting filter loss. For the reference with MSE loss just omit the lower branch and set

|W_{\ell}|\!=\!1

In this paper, the single-channel noisy mixture signal is modelled as $y(n)\!=\!s(n)\!+\!d(n)$ , with $s(n)$ being the clean speech signal, $d(n)$ being the additive noise, and $n$ being the discrete-time sample index. By applying a $K$ -point FFT, the frequency domain analogue is $Y_{\ell}(k)\!=\!S_{\ell}(k)\!+\!D_{\ell}(k)$ , with $\ell$ being the frame index and $k\!\in\left\{0,\!1,\!\ldots\!,\!K\!-\!1\right\}$ being the frequency bin index. The frames are assembled by applying the periodic Hann window function with 16 ms frame length and 50% overlap.

The reference DNN trained with MSE loss and the proposed approach using the weighting filter loss share the same enhancement structure apart from the loss function. The shared structure is shown in the upper grey-shaded part of Figure 1 (denoted as “Enhanced speech spectral amplitudes”). The input of the DNN are the spectral amplitudes of the current noisy signal’s frame $|Y_{\ell}(k)|$ along with the spectral amplitudes of two left and two right context frames, resulting in the number of DNN input nodes being $5\!\times\!129\!=\!645$ . Note that only the first $K\!/2\!+\!1=\!129$ spectral amplitude values per frame are to be enhanced with $K\!=\!256$ , since the other values can be obtained symmetrically. Then, the mask $|M_{\ell}(k)|$ , implicitly predicted by the DNN, is multiplied to the noisy spectral amplitudes $|Y_{\ell}(k)|$ to generate the enhanced speech spectral amplitudes

\widehat{|S_{\ell}(k)|}=|Y_{\ell}(k)|\cdot|M_{\ell}(k)|.\vskip-2.84544pt

(1)

It is worth noting that the targets for DNN training are the clean speech spectral amplitudes $|S_{\ell}(k)|$ . Finally, the enhanced speech signal $\hat{s}(n)$ is obtained by performing the inverse FFT together with the noisy phase from $Y_{\ell}(k)$ and overlap-add synthesis.

The DNN topology used in this paper is shown in Figure 2. The DNN contains five hidden layers, which are fully-connected layers, followed by batch normalization, leaky rectified linear unit (ReLU) activation functions, and dropout with a dropout rate of 0.2. In addition, skip connections are used to add up the layer outputs with matching dimensions, in order to ease the vanishing gradient problem during training [29]. The output layer, implicitly predicting the mask spectral amplitudes $|M_{\ell}(k)|$ , has $K\!/2\!+\!1=\!129$ nodes with sigmoid activation functions to limit the output to $0\!\leq\!|M_{\ell}(k)|\!\leq\!1$ .

\psfrag{a}[cc][cc]{\footnotesize$[\,|Y_{\ell\!-\!2}\!(k)|,$}\psfrag{z}[cc][cc]{\footnotesize$\ldots,$}\psfrag{y}[cc][cc]{\footnotesize$|Y_{\ell\!+\!2}\!(k)|\,]$}\psfrag{b}[cc][cc]{\footnotesize\rotatebox{90.0}{FC(1024)}}\psfrag{c}[cc][cc]{\footnotesize\rotatebox{90.0}{FC(512)}}\psfrag{d}[cc][cc]{\footnotesize\rotatebox{90.0}{FC(512)}}\psfrag{e}[cc][cc]{\footnotesize\rotatebox{90.0}{FC(512)}}\psfrag{f}[cc][cc]{\footnotesize\rotatebox{90.0}{FC(256)}}\psfrag{g}[cc][cc]{\footnotesize Output}\psfrag{i}[cc][cc]{\footnotesize Layer}\psfrag{j}[cc][cc]{\footnotesize$|M_{\ell}(k)|$}\psfrag{h}[cc][cc]{\footnotesize Hidden layers}\psfrag{k}[cr][cr]{\footnotesize DNN}\includegraphics[width=240.57114pt]{dnn_topolo.eps}

Figure 2: Detailed view of the DNN: The operation FC(

N

) stands for a fully-connected layer with

N

nodes, followed by batch normalization, leaky ReLU activation functions, and dropout. The output layer contains batch normalization, a fully-connected layer with 129 nodes, and sigmoid activation functions.

3 Perceptual Weighting Filter Loss

3.1 Perceptual Weighting Filter in CELP Speech Coding

In AMR speech coding, the perceptual weighting filter, employed in the analysis-by-synthesis search of the codebooks, is expressed according to [24] as

W_{\ell}(z)=\frac{1-A_{\ell}(z/\gamma_{1})}{1-A_{\ell}(z/\gamma_{2})},\vskip-5.69046pt

(2)

where $A_{\ell}(z/\gamma)\!=\!\sum_{i=1}^{N_{p}}\!a_{\ell}(i)\gamma^{i}z^{-i}$ , $a_{\ell}(i)$ are the linear prediction (LP) coefficients of frame $\ell$ , $N_{p}$ is the prediction order, and $\gamma_{1}$ , $\gamma_{2}$ are the perceptual weighting factors. As the search of the codebooks is done by minimizing the weighted error between the clean speech and the coded speech, the weighted error becomes spectrally white, meaning that the final (unweighted) coding error is proportional to the inverse weighting filter $1/W_{\ell}(z)$ . This inverse weighting filter has similarities to the structure of the clean speech spectral envelope which exploits the masking property of the human ear: More energy of the quantization error will be in the speech formant regions, as $1/W_{\ell}(z)$ is somewhat below the spectral envelope there.

A variant of the above weighting filter, originally used in AMR-WB (also used in the recent EVS codec) [25, 26] is proposed as

W^{\prime}_{\ell}(z)=1-A^{\prime}_{\ell}(z/\gamma_{1}),\vskip-2.84544pt

(3)

where the LP coefficients $a^{\prime}_{\ell}(i)$ are computed based on the speech being preemphasized by a filter $H_{\text{pre}}(z)\!=\!1\!-\!\beta z^{-1}$ . Note that the weighting filter in the codecs is originally of the form $\big{(}1\!-\!A^{\prime}_{\ell}(z/\gamma_{1})\big{)}/H_{\text{pre}}(z)$ and a de-emphasis filter $H_{\text{pre}}^{-1}(z)$ is applied in the decoder. Therefore, the quantization error is actually proportional to $1/\big{(}1\!-\!A^{\prime}_{\ell}(z/\gamma_{1})\big{)}$ [25], resulting in the equivalent weighting filter (3). Both filter types being employed as potential weighting filter losses will be evaluated and discussed in the next section.

To intuitively understand this shaping effect, we exemplarily show the clean speech spectral envelope, the pedestrian (PED) noise amplitude spectrum, and the noise shaped by the inverse weighting filter (2) for an example frame in Figure 3. This example frame is from the speech file bbaf1s of the Grid corpus [30] downsampled to 16 kHz, the pedestrian noise is from the CHiME-3 dataset [31], and the SNR is 5 dB. The AMR weighting filter (2) is used in this example with $N_{p}\!=\!16$ , $\gamma_{1}\!=\!0.92$ , and $\gamma_{2}\!=\!0.6$ . It shows nicely, how the noise is shaped to be “hidden” under the speech spectral envelope, so that the shaped noise will be less perceivable by the human ear.

\psfrag{0}[tl][tl]{\footnotesize$0$}\psfrag{16}[tc][tc]{\footnotesize$16$}\psfrag{32}[tc][tc]{\footnotesize$32$}\psfrag{48}[tc][tc]{\footnotesize$48$}\psfrag{64}[tc][tc]{\footnotesize$64$}\psfrag{80}[tc][tc]{\footnotesize$80$}\psfrag{96}[tc][tc]{\footnotesize$96$}\psfrag{112}[tc][tc]{\footnotesize$112$}\psfrag{128}[tc][tc]{\footnotesize$128$}\psfrag{45}[cr][cr]{\footnotesize$45$}\psfrag{20}[cr][cr]{\footnotesize$20$}\psfrag{30}[cr][cr]{\footnotesize$30$}\psfrag{40}[cr][cr]{\footnotesize$40$}\psfrag{50}[cr][cr]{\footnotesize$50$}\psfrag{55}[cr][tr]{\footnotesize$55$}\psfrag{25}[cr][cr]{\footnotesize$25$}\psfrag{35}[cr][cr]{\footnotesize$35$}\psfrag{ylabeltext}[bc][tc]{\footnotesize$10$log${}_{10}(\cdot)$ [dB]}\psfrag{Frequencybinsk}[tc][bc]{\footnotesize Frequency bin $k$}\includegraphics[width=227.90501pt]{fig_invers_wf.eps}

Figure 3: Clean speech spectral envelope, pedestrian (PED) noise amplitude spectrum, and the inverse weighted PED noise amplitude spectrum for an example frame.

3.2 Perceptual Weighting Filter Loss in DNN Training

Now that the perceptual weighting filter from CELP speech coding has been revisited, we show how a weighting filter loss is constructed for DNN training in Figure 1. We derive the loss from the AMR weighting filter, while the AMR-WB weighting filter is obtained straightforward. First, LP analysis is performed based on the clean speech frame $s_{\ell}(n)$ to obtain the frame-wise LP coefficients $a_{\ell}(i)$ . Then, the weighting filter $W_{\ell}(z)$ is calculated by (2) and the corresponding frequency amplitude response is obtained as

W_{\ell}(k)=W_{\ell}(z)\!\Bigm{|}_{z=e^{j\!\frac{2\pi k}{K}}},\ \ \ \ \ k\in\{0,1,\ldots,K/2\}.\vskip-2.84544pt

(4)

Finally, the loss function is obtained as

J_{\ell}=\big{(}E^{\text{w}}_{\ell}(0)\big{)}^{2}+\big{(}E^{\text{w}}_{\ell}(\tfrac{K}{2})\big{)}^{2}+2\cdot\!\!\sum_{k=1}^{K\!/2\!-\!1}\big{(}E_{\ell}^{\text{w}}(k)\big{)}^{2},\vskip-5.69046pt

(5)

where $E_{\ell}^{\text{w}}(k)\!=\!|W_{\ell}(k)|\cdot E_{\ell}(k)$ is the weighted error, and $E_{\ell}(k)\!=\!|S_{\ell}(k)|-\widehat{|S_{\ell}(k)|}$ is the training error. As $E_{\ell}^{\text{w}}(k)$ becomes spectrally white, we find that $E_{\ell}(k)\!\sim\!1/|W_{\ell}(k)|$ . As a result, the residual noise is expected to be less audible. It is worth noting that in the enhancement stage, the inference process of this DNN is the same as for the reference DNN.

4 Experimental Evaluation

4.1 Experimental Setup and Evaluation Metrics

The speech material used in this paper is from the Grid corpus [30] and is downsampled to 16 kHz. We randomly select 8 female and 8 male speakers, and each speaker includes 100 sentences with a duration of 3 seconds per sentence. Thus, the total amount of clean speech used for training and validation is 80 minutes long. Another 2 female and 2 male speakers with 20 sentences each are used for testing. The additive background noise is from the CHiME-3 dataset: Pedestrian noise (PED), café noise (CAF), and street noise (STR) are used for the training, validation, and test, while bus noise (BUS) is used additionally as an unseen noise type for testing. Training and validation material contains the noisy data at six SNR levels (from -5 dB to 20 dB with 5 dB step size), and these two sets are obtained by a split with a ratio of 4 $:$ 1 with a proportional number of sentences covered by all the speakers, noise types and SNR levels. Note that the noise material used in the test set is also disjoint from that for training and validation.

Regarding the DNN training, the input is first normalized to have zero mean and unit variance based on the statistics of the training data. Then, the weights of the DNN are trained by minimizing the loss function, applying the Adam algorithm with the learning rate being $5\!\cdot\!10^{-4}$ . The weights are updated after each minibatch containing 128 input and target pairs, which are randomly selected from the training data.

\psfrag{01}[cc][tc]{\scriptsize$1$ }\psfrag{0098}[cc][tc]{\scriptsize$0.98$}\psfrag{0096}[cc][tc]{\scriptsize$0.96$}\psfrag{0094}[cc][tc]{\scriptsize$0.94$}\psfrag{0092}[cc][tc]{\scriptsize$0.92$}\psfrag{009}[cc][tc]{\scriptsize$0.9$}\psfrag{0088}[cc][tc]{\scriptsize$0.88$}\psfrag{0086}[cc][tc]{\scriptsize$0.86$}\psfrag{0084}[cc][tc]{\scriptsize$0.84$}\psfrag{0082}[cc][tc]{\scriptsize$0.82$}\psfrag{008}[cc][tc]{\scriptsize$0.8$}\psfrag{0078}[cc][tc]{\scriptsize$0.78$}\psfrag{0076}[cc][tc]{\scriptsize$0.76$}\psfrag{0074}[cc][tc]{\scriptsize$0.74$}\psfrag{0072}[cc][tc]{\scriptsize$0.72$}\psfrag{007}[cc][tc]{\scriptsize$0.7$}\psfrag{2.1}[cl][tl]{\footnotesize$2.1$}\psfrag{2.15}[cr][cr]{\footnotesize$2.15$}\psfrag{2.2}[cr][cr]{\footnotesize$2.2$}\psfrag{2.25}[cr][cr]{\footnotesize$2.25$}\psfrag{2.3}[cr][cr]{\footnotesize$2.3$}\psfrag{2.35}[cr][cr]{\footnotesize$2.35$}\psfrag{2.4}[cr][cr]{\footnotesize$2.4$}\psfrag{2.45}[cr][cr]{\footnotesize$2.45$}\psfrag{2.5}[cr][cr]{\footnotesize$2.5$}\psfrag{2.55}[cr][cr]{\footnotesize$2.55$}\psfrag{ylabeltext}[bc][tc]{\footnotesize PESQ}\psfrag{BaselineDNNNN}[cc][cc]{\footnotesize Reference DNN}\psfrag{AMR}[cc][cc]{\footnotesize AMR}\psfrag{AMRWB}[cc][cc]{\footnotesize AMR-WB}\psfrag{ggaamma1}[cc][cc]{\footnotesize$\gamma_{1}$}\psfrag{ggaamma12}[cc][cc]{\footnotesize$\gamma_{1}$}\includegraphics[width=227.90501pt]{fig_elav_data_new.eps}

Figure 4: PESQ performance for the proposed loss functions applying the weighting filters from AMR and AMR-WB, with the perceptual weighting factor

\gamma_{1}\!\in\!\{1,\!0.98,\!\ldots,\!0.7\}

\gamma_{2}\!=\!0.6

(AMR), and the preemphasis filter factor

\beta\!=\!0.68

(AMR-WB) on the development dataset. The optimal setting is marked with

\bm{\ast}

In order to evaluate the speech enhancement system including both, speech and noise components, PESQ [32] and STOI [10] are used to measure the quality and the intelligibility of the enhanced speech. Also, SNR improvement (SNRI) [33] is used to measure the SNR improvement achieved by the speech enhancement system.

Regarding the evaluation metric for the speech component, segmental speech-to-speech-distortion ratio (SSDR) is calculated after [34] as

\text{SSDR}=\frac{1}{\left|\mathcal{L}\right|}\sum_{\ell\in\mathcal{L}}\text{SSDR}(\ell),\vskip-2.84544pt

(6)

where $\mathcal{L}$ is the set of frame indices for speech active frames and $\text{SSDR}(\ell)$ is limited from $R_{\text{min}}\!=\!-10$ dB to $R_{\text{max}}\!=\!30$ dB by $\text{SSDR}(\ell)\!=\!\text{max}\left\{\text{min}\left\{\text{SSDR}^{\prime}(\ell),\!R_{\text{max}}\right\}\!,\!R_{\text{min}}\right\}$ . The term $\text{SSDR}^{\prime}(\ell)$ is actually calculated as

\text{SSDR}^{\prime}(\ell)=10\text{log}_{10}\left[\frac{\sum_{n\in\mathcal{N}_{\ell}}s(n)^{2}}{\sum_{n\in\mathcal{N}_{\ell}}\left(\tilde{s}(n)-s(n)\right)^{2}}\right]\;[\text{dB}],\vskip-2.84544pt

(7)

where $\mathcal{N}_{\ell}$ is the set of sample indices $n$ belonging to frame $\ell$ , $s(n)$ is the clean speech, and $\tilde{s}(n)$ is the time-aligned filtered speech component. Note that this filtered speech component $\tilde{s}(n)$ and the following filtered noise component $\tilde{d}(n)$ are obtained on the basis of (1), but replacing the noisy speech spectral amplitudes $|Y_{\ell}(k)|$ by either the speech component spectral amplitudes $|S_{\ell}(k)|$ , or the noise component spectral amplitudes $|D_{\ell}(k)|$ , respectively [35].

Another evaluation metric we use in this paper to measure the SNR improvement is $\Delta\text{SNR}\!=\!\text{SNR}_{\text{out}}\!-\!\text{SNR}_{\text{in}}$ , where $\text{SNR}_{\text{in}}$ is the SNR level for the noisy mixture $y(n)$ , and $\text{SNR}_{\text{out}}$ is calculated based on $\tilde{s}(n)$ and $\tilde{d}(n)$ .

4.2 Experimental Results and Discussion

Noise	Method	Noise	Speech	Total
Type		Component	Component
		$\Delta$ SNR [dB]	SSDR [dB]	PESQ	STOI	SNRI [dB]
PED	Noisy	-	-	1.98	0.66	-
	Reference DNN	4.78	17.34	2.50	0.69	10.56
	Weighted DNN	5.55	15.57	2.60	0.69	14.48
CAF	Noisy	-	-	1.99	0.64	-
	Reference DNN	4.84	17.43	2.50	0.68	11.04
	Weighted DNN	5.48	15.79	2.57	0.68	15.82
STR	Noisy	-	-	1.97	0.69	-
	Reference DNN	5.77	18.54	2.52	0.72	12.11
	Weighted DNN	6.41	16.99	2.63	0.72	16.47
BUS	Noisy	-	-	2.19	0.72	-
(un-	Reference DNN	4.03	20.85	2.65	0.74	6.08
seen)	Weighted DNN	4.93	19.76	2.70	0.74	9.71

Table 1: Evaluation metrics for the proposed approach (denoted as weighted DNN) and the reference DNN approach under different noise conditions.

\Delta

SNR, SSDR, and SNRI are measured in dB. The better approach is in boldface.

We first search for the optimal parameter $\gamma_{1}$ of the perceptual weighting filter applied to the loss function by conducting experiments on a development dataset. This development dataset is a subset of the validation data, which uses a quarter of the data from the validation dataset covering all speakers, noise types, and SNR levels. It is used to decrease the amount of data for optimal parameter search, thus improving efficiency. As shown in Figure 4, the two abovementioned perceptual weighting filters (see (2) and (3), with $N_{p}\!=\!16$ ) are evaluated with various perceptual weighting factors. Regarding the weighting filter from AMR, we search the optimal $\gamma_{1}$ , as different $\gamma_{1}$ values are also applied for different bitrates in the AMR codec [24]. In the meanwhile, we keep $\gamma_{2}\!=\!0.6$ unchanged because an informal search on $\gamma_{2}$ with limited values shows that the performance has a rather weak dependency on $\gamma_{2}$ . The same search range of $\gamma_{1}$ is also investigated for the weighting filter from AMR-WB, and we keep the preemphasis filter factor $\beta\!=\!0.68$ as in [25]. The factor combinations actually define the spectral shape and the spectral tilt of the weighting filter. It can be seen that under the investigated parameters, the weighting filters with various $\gamma_{1}$ from AMR shows generally better performance compared to those from AMR-WB. The weighting filters of AMR with $\gamma_{1}$ in the selected range (left part in Figure 4) can outperform the reference DNN except for some outlier values of $\gamma_{1}$ . The optimal settings are the weighting filter from AMR (2) with $\gamma_{1}\!=\!0.92,\,\gamma_{2}\!=\!0.6$ . The DNN model optimized by the loss function with the optimal weighting filter settings will be used in the following experiments (denoted as Weighted DNN).

In Table 1, we report the results on the test dataset and compare them to the reference DNN. The results are collected in four noise conditions and all values are averaged over six SNRs from -5 to 20 dB. We can see the PESQ score improvements from the noisy speech $y(n)$ to the enhanced speech $\hat{s}(n)$ for both approaches. Compared to the reference DNN, the proposed method achieves improved PESQ performance in the range of 0.07…0.11 points for the three seen noise types (PED, CAF, and STR), and 0.05 points improvement for the unseen BUS noise condition. The STOI measures are comparable for the two approaches. Regarding the SNR improvement, the proposed approach consistently outperforms the reference DNN in terms of SNRI and $\Delta$ SNR by more than 3.5 dB and 0.6 dB, respectively, for all four noise conditions. The results support the effectiveness of the proposed loss function, showing that the difference between the enhanced speech and the clean speech is less perceptually significant.

We notice that the SSDR is decreased by using the weighting filter loss function, which is easy to explain: The trained DNN with the weighting filter loss is biased to focus on the spectrum, i.e., more focus is put on the clean speech spectral valley regions and less on the formant regions. This effectively improves the perceptual quality of the enhanced speech as shown in the above PESQ scores, however, it introduces some measurable distortions to the speech component. This proves that the weighting filter loss does what it is expected to do: It does not excel the reference DNN in terms of MSE (or SSDR), but only subjectively (shown by PESQ).

We further show some results in various SNR levels in Figure 5, where all the values are averaged over the four noise types. The PESQ score improvement for the proposed approach over the reference DNN is more significant in higher SNR levels than in lower SNR levels. As already shown in Table 1, the STOI measures for the two approaches are quite comparable also across various SNR levels. Regarding the SNRI metric, the proposed method clearly excels the reference DNN in all SNR conditions, obtaining 4.17 dB SNRI improvement on average of SNR conditions and noise types.

\psfrag{AMR1}[cc][cl]{\scriptsize AMR, $\gamma_{1}\!=\!1$ }\psfrag{AMR255}[cc][cl]{\scriptsize AMR, $\gamma_{1}\!=\!0.98$}\psfrag{AMR355}[cc][cl]{\scriptsize AMR, $\gamma_{1}\!=\!0.96$}\psfrag{AMR455}[cc][cl]{\scriptsize AMR, $\gamma_{1}\!=\!0.94$}\psfrag{AMR555}[cc][cl]{\scriptsize AMR, $\gamma_{1}\!=\!0.92$}\psfrag{AMR65}[cc][cl]{\scriptsize AMR, $\gamma_{1}\!=\!0.9$}\psfrag{EVS1}[cc][cl]{\scriptsize EVS, $\gamma_{1}\!=\!1$}\psfrag{EVS255}[cc][cl]{\scriptsize EVS, $\gamma_{1}\!=\!0.98$}\psfrag{EVS355}[cc][cl]{\scriptsize EVS, $\gamma_{1}\!=\!0.96$}\psfrag{EVS455}[cc][cl]{\scriptsize EVS, $\gamma_{1}\!=\!0.94$}\psfrag{EVS555}[cc][cl]{\scriptsize EVS, $\gamma_{1}\!=\!0.92$}\psfrag{EVS65}[cc][cl]{\scriptsize EVS, $\gamma_{1}\!=\!0.9$}\psfrag{1}[tl][tr]{\footnotesize$-5$}\psfrag{2}[tc][tc]{\footnotesize$0$}\psfrag{3}[tc][tc]{\footnotesize$5$}\psfrag{4}[tc][tc]{\footnotesize$10$}\psfrag{5}[tc][tc]{\footnotesize$15$}\psfrag{6}[tc][tc]{\footnotesize$20$}\psfrag{a}[cr][cl]{\footnotesize$5$}\psfrag{10}[cr][cr]{\footnotesize$10$}\psfrag{15}[cr][cr]{\footnotesize$15$}\psfrag{20}[cr][cr]{\footnotesize$20$}\psfrag{0.5}[cr][cr]{\footnotesize$0.5$}\psfrag{0.6}[cr][cr]{\footnotesize$0.6$}\psfrag{0.7}[cr][cr]{\footnotesize$0.7$}\psfrag{0.8}[cr][cr]{\footnotesize$0.8$}\psfrag{0.9}[cr][cr]{\footnotesize$0.9$}\psfrag{1.5}[cr][cr]{\footnotesize$1.5$}\psfrag{b}[cr][cr]{\footnotesize$2$}\psfrag{2.5}[cr][cr]{\footnotesize$2.5$}\psfrag{c}[cr][cr]{\footnotesize$3$}\psfrag{3.5}[cr][cr]{\footnotesize$3.5$}\psfrag{d}[cr][cr]{\footnotesize$4$}\psfrag{PESQMOS}[bc][tc]{\footnotesize PESQ}\psfrag{STOI}[bc][tc]{\footnotesize STOI}\psfrag{SNRIdB}[bc][tc]{\footnotesize SNRI [dB]}\psfrag{SNRdB}[tc][bc]{\footnotesize SNR [dB]}\psfrag{WeightedDNNNNN}[cc][cc]{\footnotesize Weighted DNN}\psfrag{BaselineDNN}[cl][cl]{\footnotesize Reference DNN}\psfrag{PESQtext}[tl][tl]{\footnotesize PESQ (noisy)}\psfrag{STOItext}[tl][tl]{\footnotesize STOI (noisy)}\includegraphics[width=227.90501pt]{fig_test_data_curve.eps}

Figure 5: PESQ, STOI, and SNRI measures for various SNRs averaged over all four noise types of the proposed approach (denoted as weighted DNN) and the reference DNN approach.

5 Conclusions

In this paper, a perceptual weighting filter loss is designed for deep neural network (DNN) training in speech enhancement by applying the weighting filter from CELP speech coding. Simulation results show that the proposed approach outperforms the reference DNN trained with mean squared error loss in terms of speech quality measured by PESQ and significantly higher noise attenuation measured by more than 4 dB SNR improvement and 0.7 dB $\Delta$ SNR on average over SNR levels and noise types. The proposed loss function could be applied advantageously to an existing DNN-based speech enhancement system, without modification of the DNN topology and the speech enhancement framework. The source code for the proposed approach is available at https://github.com/ifnspaml/Perceptual-Weighting-Filter-Loss.

References

[1] S. Mirsamadi and I. Tashev, “Causal Speech Enhancement Combining Data-Driven Learning and Suppression Rule Estimation,” in Proc. of INTERSPEECH, San Francisco, CA, USA, Sept. 2016, pp. 2870–2874.
[2] T. Fingscheidt and S. Suhadi, “Data-Driven Speech Enhancement,” in Proc. of ITG-Fachtagung ”Sprachkommunikation”, Kiel, Germany, Apr. 2006, pp. 1–4.
[3] T. Fingscheidt, S. Suhadi, and S. Stan, “Environment-Optimized Speech Enhancement,” IEEE T-ASLP, vol. 16, no. 4, pp. 825–834, May 2008.
[4] Y. Wang and D. Wang, “Towards Scaling up Classification-Based Speech Separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 7, pp. 1381–1390, Mar. 2013.
[5] A. Narayanan and D. Wang, “Ideal Ratio Mask Estimation Using Deep Neural Networks for Robust Speech Recognition,” in Proc. of ICASSP, Vancouver, BC, Canada, May 2013, pp. 7092–7096.
[6] Y. Wang, A. Narayanan, and D. Wang, “On Training Targets for Supervised Speech Separation,” IEEE/ACM T-ASLP, vol. 22, no. 12, pp. 1849–1858, Dec. 2014.
[7] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE/ACM T-ASLP, vol. 23, no. 1, pp. 7–19, Oct. 2014.
[8] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-Sensitive and Recognition-Boosted Speech Separation Using Deep Recurrent Neural Networks,” in Proc. of ICASSP, Brisbane, QLD, Australia, Aug. 2015, pp. 708–712.
[9] A. Pandey and D. Wang, “A New Framework for Supervised Speech Enhancement in the Time Domain,” in Proc. of INTERSPEECH, Hyderabad, India, Sept. 2018, pp. 1136–1140.
[10] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech,” IEEE T-ASLP, vol. 19, no. 7, pp. 2125–2136, Sept. 2011.
[11] ITU, Rec. P.862: Perceptual Evaluation of Speech Quality (PESQ), International Telecommunication Union, Telecommunication Standardization Sector (ITU-T), Feb. 2001.
[12] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks,” IEEE/ACM T-ASLP, vol. 26, no. 9, pp. 1570–1584, Sept. 2018.
[13] M. Kolbcek, Z.-H. Tan, and J. Jensen, “Monaural Speech Enhancement Using Deep Neural Networks by Maximizing A Short-Time Objective Intelligibility Measure,” in Proc. of ICASSP, Calgary, AB, Canada, Apr. 2018, pp. 5059–5063.
[14] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually Guided Speech Enhancement Using Deep Neural Networks,” in Proc. of ICASSP, Calgary, AB, Canada, Apr. 2018, pp. 5074–5078.
[15] J. M. Martín-Doñas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, “A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality,” IEEE Signal Process. Lett., vol. 25, no. 11, pp. 1680–1684, 2018.
[16] H. Zhang, X. Zhang, and G. Gao, “Training Supervised Speech Separation System to Improve STOI and PESQ Directly,” in Proc. of ICASSP, Calgary, AB, Canada, Apr. 2018, pp. 5374–5378.
[17] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “DNN-Based Source Enhancement to Increase Objective Sound Quality Assessment Score,” IEEE/ACM T-ASLP, vol. 26, no. 10, pp. 1780–1792, 2018.
[18] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “MetricGAN: Generative Adversarial Networks Based Black-Box Metric Scores Optimization for Speech Enhancement,” in Proc. of ICML, Long Beach, CA, USA, June 2019, pp. 2031–2041.
[19] B. Xia and C. Bao, “Wiener Filtering Based Speech Enhancement With Weighted Denoising Auto-Encoder and Noise Classification,” Speech Communication, vol. 60, pp. 13–29, Feb. 2014.
[20] Q. Liu, W. Wang, P. J. Jackson, and Y. Tang, “A Perceptually-Weighted Deep Neural Network for Monaural Speech Enhancement in Various Background Noise Conditions,” in Proc. of EUSIPCO, Kos, Greece, Aug. 2017, pp. 1270–1274.
[21] T. Painter and A. Spanias, “A Review of Algorithms for Perceptual Coding of Digital Audio Signals,” in Proc. of International Conference on Digital Signal Processing, Santorini, Greece, July 1997, pp. 179–208.
[22] A. Kumar and D. Florencio, “Speech Enhancement in Multiple-Noise Conditions Using Deep Neural Networks,” in Proc. of INTERSPEECH, San Francisco, CA, USA, Sept. 2016, pp. 352–356.
[23] W. Han, X. Zhang, G. Min, X. Zhou, and W. Zhang, “Perceptual Weighting Deep Neural Networks for Single-Channel Speech Enhancement,” in Proc. of WCICA, Guilin, China, June 2016, pp. 446–450.
[24] 3GPP, Mandatory Speech Codec Speech Processing Functions; Adaptive Multi-Rate (AMR) Speech Codec; Transcoding Functions (3GPP TS 26.090, Rel. 14), 3GPP; TSG SA, Mar. 2017.
[25] ——, Speech Codec Speech Processing Functions; Adaptive Multi-Rate-Wideband (AMR-WB) Speech Codec; Transcoding Functions (3GPP TS 26.190, Rel. 14), 3GPP; TSG SA, Mar. 2017.
[26] ——, Codec for Enhanced Voice Services (EVS); Detailed Algorithmic Description (3GPP TS 26.445, Rel. 14), 3GPP; TSG SA, Mar. 2017.
[27] T. Bäckström, Speech Coding: Code Excited Linear Prediction. Springer, 2017.
[28] C. Brauer, Z. Zhao, D. Lorenz, and T. Fingscheidt, “Learning to Dequantize Speech Signals by Primal-Dual Networks: An Approach for Acoustic Sensor Networks,” in Proc. of ICASSP, Brighton, UK, May 2019, pp. 7000–7004.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proc. of CVPR, Las Vegas, NV, USA, June 2016, pp. 770–778.
[30] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An Audio-Visual Corpus for Speech Perception and Automatic Speech Recognition,” The Journal of the Acoustical Society of America, vol. 120, no. 5, pp. 2421–2424, June 2006.
[31] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The Third ‘CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,” in Proc. of ASRU, Scottsdale, AZ, USA, Feb. 2015, pp. 504–511.
[32] ITU, Rec. P.862.2: Corrigendum 1, Wideband Extension to Recommendation P.862 for the Assessment of Wideband Telephone Networks and Speech Codecs, International Telecommunication Union, Telecommunication Standardization Sector (ITU-T), Oct. 2017.
[33] ——, Rec. G.160: Voice Enhancement Devices; Appendix II: Objective Measures for the Characterization of the Basic Functioning of Noise Reduction Algorithms, International Telecommunication Union, Telecommunication Standardization Sector (ITU-T), June 2012.
[34] S. Elshamy, N. Madhu, W. Tirry, and T. Fingscheidt, “Instantaneous A Priori SNR Estimation by Cepstral Excitation Manipulation,” IEEE/ACM T-ASLP, vol. 25, no. 8, pp. 1592–1605, Aug. 2017.
[35] S. Gustafsson, R. Martin, and P. Vary, “On the Optimization of Speech Enhancement Systems Using Instrumental Measures,” in Proc. of Workshop on Quality Assessment in Speech, Audio, and Image Communication, Darmstadt, Germany, Mar. 1996, pp. 36–40.

A Perceptual Weighting Filter Loss for DNN Training in Speech Enhancement