A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders

Abstract

Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to iteratively estimate the power spectrogram of the clean speech. Our main contribution is the analytical derivation of the variational steps in which the encoder of the pre-learned VAE can be used to estimate the variational approximation of the true posterior distribution, using the very same assumption made to train VAEs. Experiments show that the proposed method produces results on par with the aforementioned iterative methods using sampling, while decreasing the computational cost by a factor 36 to reach a given performance.

{textblock*}

3cm(11.1cm,-6.2cm) SUBMITTED TO INTERSPEECH 2019

Index Terms: Speech enhancement, variational autoencoders, variational Bayes, non-negative matrix factorization.

1 Introduction

Speech enhancement is the problem of extracting a speech source from a noisy recording [1, 2]. It is commonly tackled by discriminative methods using deep neural networks (DNNs) to predict time-frequency masks [3, 4], clean power spectrograms[5] or to model the signal variance [6] from spectrograms of mixtures. While the performances of these methods are satisfactory, their generalization capabilities are limited due to their supervised nature. Indeed, DNNs trained this way are specific to a task and a recording condition. Instead, generative approaches do not suffer from this drawback. Once a source model has been learned, the statistical model can account for the different tasks and recording conditions.

Non-negative matrix factorization (NMF) [7] is a popular method for speech enhancement in which power or amplitude spectrograms are approximated as the product of two non-negative matrices. While certain forms of NMF can be interpreted as generative, they suffer from their linearity assumptions. Very recent studies have combined the modelling power of DNN with statistical approaches in order to design unsupervised speech enhancement and source separation methods in both single-channel [8, 9, 10] and multichannel scenarios [11, 12, 13, 14, 15]. The main idea of these studies is to use variational autoencoders (VAEs) [16] to replace the NMF generative model, thus benefiting from DNNs’ representational power. Combined with an observation model, Expectation-Maximization (EM) [17] or Bayesian inference [18] algorithms can be derived to iteratively estimate individual source spectra from a given mixture.

In [8, 9, 10, 11, 12], speech enhancement is performed using a pretrained VAE-based generative model of speech spectra combined with an unsupervised NMF model for the noise. Inference of the clean speech involves using the Metropolis-Hastings algorithm [19] to estimate an intractable distribution over the VAE’s latent space. Similar approaches applied to source separation are developed in [13, 14], where the latent variables are updated using backpropagation. While the algorithms analytically derived in [8, 9, 10, 11, 12, 13, 14, 15] show promising performances, sampling and backpropagation methods are computationally expensive.

In order to train a VAE, an encoder is jointly learned with the generative model. The role of the encoder is to approximate the true, but intractable, posterior. Once the VAE is trained, the encoder could in principle be used to approximate the true posterior. Interestingly, [15] proposes a heuristic algorithm for multichannel source separation based on [13] which uses this property of the encoder to achieve computational efficiency. However this inference algorithm is not statistically principled.

In this study, we present a Bayesian single-channel speech enhancement algorithm in which a VAE is used as generative model of speech spectra, and the noise is modelled using NMF. After unsupervised training of the speech model, the inference constits in iteratively updating the source estimates using Wiener filtering and updating the corresponding latent representation using the encoder. No sampling or backpropagation being required, our method is significantly faster than in [8, 9, 10, 11, 12, 13, 14]. The use of the encoder differs from [15], and, unlike it, it is not heuristic but motivated by the same variational approximation as the one used to train VAEs. To the best of our knowledge, this is the first method to propose a statistically principled approach re-using the probabilistic encoder as a posterior approximator.

We first introduce the VAE framework and its application to spectrogram modelling. We then detail our assumptions and our inference algorithm. Experiments and results will be presented before we finally conclude.

2 Model

Given an observation of clean speech embedded in noise, the goal of speech enhancement is to produce an estimate of the clean speech. Our approach is to use a VAE as the generative model of clean speech spectra to infer this estimate. We will first describe the VAE framework and its training procedure in Section 2.1 and 2.2 before introducing the observation model in Section 2.3.

2.1 Generative speech model

A common approach to modeling sound sources is to assume the short time Fourier Transform (STFT) coefficients are proper complex Gaussian random variables. With $f$ denoting the frequency index and $t$ the time-frame index, for all $(f,t)\in\{0,...,F-1\}\times\{0,...,N-1\}$ , each source STFT coefficient $s_{ft}\in\mathbb{C}$ independently follows

s_{ft}\sim\mathcal{N}_{c}(0,\,\sigma^{2}_{ft}).

(1)

Several frameworks have been used to model the variance $\sigma^{2}_{ft}$ of this distribution [20, 2]. To this end, we choose to use the VAE framework, which uses the representational power of DNNs to build generative models. It consists in mapping a zero-mean unit Gaussian random variable $\mbox{\boldmath$z$}_{t}\in\mathbb{R}^{L}$ through a DNN, into the desired output distribution. In our setting, we use it to represent $\sigma^{2}_{ft}$ . The generative model can be written as

	$\displaystyle\mbox{\boldmath$z$}_{t}$	$\displaystyle\sim\mathcal{N}(\mbox{\boldmath$0$},\mbox{\bf I}),$		(2)
	$\displaystyle s_{ft}\|\mbox{\boldmath$z$}_{t};\theta$	$\displaystyle\sim\mathcal{N}_{c}(0,\sigma_{f}^{2}(\mbox{\boldmath$z$}_{t})),$		(3)

where $\sigma_{f}^{2}:\mathbb{R}^{L}\mapsto\mathbb{R}_{+}$ is a non-linear function implemented by a DNN with parameters $\theta$ . Each $\sigma_{f}^{2}$ is one neuron of a F-dimensional output layer of a single DNN.

2.2 VAE training and posterior approximation

The VAE is trained on clean speech data by maximizing the likelihood $p_{\theta}(\mbox{\boldmath$s$})=\prod_{t}p_{\theta}(\mbox{\boldmath$s$}_{t})$ , with $\mbox{\boldmath$s$}_{t}=\left[s_{1t},...,s_{Ft}\right]^{T}$ . To do so, we introduce a variational approximation of $p_{\theta}(\mbox{\boldmath$z$}_{t}|\mbox{\boldmath$s$}_{t})$ , $q_{\phi}(\mbox{\boldmath$z$}_{t}|\mbox{\boldmath$s$}_{t})$ , parametrized by $\phi$ . We have

	$\displaystyle\log p_{\theta}(\mbox{\boldmath$s$}_{t})$	$\displaystyle=\mathcal{D}_{KL}(q_{\phi}(\mbox{\boldmath$z$}_{t}\|\mbox{\boldmath$s$}_{t})\|\|p_{\theta}(\mbox{\boldmath$z$}_{t}\|\mbox{\boldmath$s$}_{t}))+\mathcal{L}(\mbox{\boldmath$\theta$},\mbox{\boldmath$\phi$};\mbox{\boldmath$s$}_{t})$		(4)
	$\displaystyle\mathcal{L}(\mbox{\boldmath$\theta$},\mbox{\boldmath$\phi$};\mbox{\boldmath$s$}_{t})$	$\displaystyle=\mathbb{E}_{q_{\phi}}[\log p_{\theta}(\mbox{\boldmath$s$}_{t}\|\mbox{\boldmath$z$}_{t})]-\mathcal{D}_{KL}(q_{\phi}(\mbox{\boldmath$z$}_{t}\|\mbox{\boldmath$s$}_{t})\|\|p(\mbox{\boldmath$z$}_{t})),$		(5)

where $\mathcal{D}_{KL}(q||p)=-\mathbb{E}_{q}[\log(p/q)]$ denotes the Kullback-Leibler (KL) divergence. The KL term on the left hand side of (4) being always positive, $\mathcal{L}(\mbox{\boldmath$\theta$},\mbox{\boldmath$\phi$};\mbox{\boldmath$s$}_{t})$ is a lower bound of the marginal log-likelihood $\log p_{\theta}(\mbox{\boldmath$s$}_{t})$ . We can thus maximize the individual $\mathcal{L}(\mbox{\boldmath$\theta$},\mbox{\boldmath$\phi$};\mbox{\boldmath$s$}_{t})$ with respect to $\theta$ and $\phi$ in order to maximize the log-likelihood $\log p_{\theta}(\mbox{\boldmath$s$})$ . For all $(l,t)\in\{0,...,L-1\}\times\{0,...,N-1\}$ , $q_{\phi}(z_{lt}|\mbox{\boldmath$s$}_{t})$ is defined as in [8]

z_{l,t}|\mbox{\boldmath$s$}_{t};\phi\sim\mathcal{N}(\tilde{\mu}_{l}({|\mbox{\boldmath$s$}_{t}}|^{2}),\tilde{\sigma}_{l}^{2}({|\mbox{\boldmath$s$}_{t}|^{2}})),

(6)

where $\tilde{\mu}_{l}$ and $\tilde{\sigma}_{l}^{2}$ are DNN-based functions parametrized by $\phi$ . Finally, we can use (2), (3) and (6) to write (5) as

	$\displaystyle\mathcal{L}(\mbox{\boldmath$\theta$},\mbox{\boldmath$\phi$};\mbox{\boldmath$s$})$	$\displaystyle\overset{c}{=}\sum_{f}\mathbb{E}_{q_{\phi}(\mbox{\boldmath$z$}_{t}\|s_{ft})}d_{IS}(\|s_{ft}\|^{2},\sigma_{f}^{2}(\mbox{\boldmath$z$}_{t}))$		(7)
		$\displaystyle+\dfrac{1}{2}\sum_{l}\log({\tilde{\sigma}^{2}}_{l}(\|\mbox{\boldmath$s$}_{t}\|^{2}))-\tilde{\mu}^{2}_{l}(\|\mbox{\boldmath$s$}_{t}\|^{2})-\tilde{\sigma}^{2}_{l}(\|\mbox{\boldmath$s$}_{t}\|^{2}),$

where $d_{IS}(x,y)=x/y-\log(x/y)-1$ denotes the Itakura-Saito divergence, $\mbox{\boldmath$z$}_{t}=\left[z_{1t},...,z_{Lt}\right]^{T}$ and $\overset{c}{=}$ denotes equality up to a constant. Using the so-called reparametrization trick [16], we can maximize $\mathcal{L}(\mbox{\boldmath$\theta$},\mbox{\boldmath$\phi$};\mbox{\boldmath$s$})$ using gradient descent optimization algorithms [21].

The variational distribution $q_{\phi}(\mbox{\boldmath$z$}_{t}|\mbox{\boldmath$s$}_{t})$ is usually considered as an unwanted by-product of the training procedure of the VAE, only introduced to learn $p_{\theta}(\mbox{\boldmath$s$}_{t}|\mbox{\boldmath$z$}_{t})$ . However, by examining carefully (4), we can see that while maximizing $\mathcal{L}(\mbox{\boldmath$\theta$},\mbox{\boldmath$\phi$};\mbox{\boldmath$s$}_{t})$ can only increase the log-likelihood, it also decreases the KL divergence between the true posterior and its variational approximation. In other words, in addition to the generative speech model, the VAE provides a variational approximation of the true posterior. This feature will play a key role in the inference algorithm proposed in Section 3.

2.3 Noisy speech model

As in [8], the STFT coefficients $n_{ft}$ of the noise are modelled with a rank- $K$ NMF Gaussian model. For all $(f,t)$

n_{ft}\sim\mathcal{N}_{c}(0,(\mbox{\boldmath$W$}\mbox{\boldmath$H$})_{ft}),

(8)

with $\mbox{\boldmath$W$}\in\mathbb{R}_{+}^{F\times K}$ , $\mbox{\boldmath$H$}\in\mathbb{R}_{+}^{K\times N}$ .

Finally, independently for all $(f,t)$ , the single-channel observation of the noisy speech is modeled by

x_{ft}=s_{ft}+n_{ft},

(9)

where $s_{ft}$ and $n_{ft}$ are independent.

3 Inference

Given a noisy speech signal $\mbox{\boldmath$X$}=\{x_{ft}\}_{(ft)}$ , our goal is now to maximize the likelihood of $X$ given the mixture model (9), the generative model of speech (2), (3) and the NMF model (8). With $\mbox{\boldmath$n$}_{t}=\left[n_{1t},...,n_{Ft}\right]^{T}$ and $\mbox{\boldmath$H$}_{t}=\left[H_{1t},...,H_{Kt}\right]^{T}$ , we consider $\mbox{\boldmath$y$}_{t}=\{\mbox{\boldmath$s$}_{t},\mbox{\boldmath$n$}_{t},\mbox{\boldmath$z$}_{t}\}$ to be the set of latent variables and $\mbox{\boldmath$\Theta$}_{t}=\{\mbox{\boldmath$W$},\mbox{\boldmath$H$}_{t}\}$ the parameters of the model. We introduce $r(\mbox{\boldmath$y$}_{t})$ , a variational approximation of $p(\mbox{\boldmath$y$}_{t}|\mbox{\boldmath$x$}_{t};\mbox{\boldmath$\Theta$}_{t})$ . We have

	$\displaystyle\log p(\mbox{\boldmath$x$}_{t};\mbox{\boldmath$\Theta$}_{t})$	$\displaystyle=\mathcal{D}_{KL}(r(\mbox{\boldmath$y$}_{t})\|\|p(\mbox{\boldmath$y$}_{t}\|\mbox{\boldmath$x$}_{t};\mbox{\boldmath$\Theta$}_{t}))+\mathcal{L}(r,\mbox{\boldmath$\Theta$}_{t})$
	$\displaystyle\mathcal{L}(r,\mbox{\boldmath$\Theta$}_{t})$	$\displaystyle=\mathbb{E}_{r(\mbox{\boldmath$y$}_{t};\mbox{\boldmath$\Theta$})}\left[\log\frac{p(\mbox{\boldmath$x$}_{t},\mbox{\boldmath$y$}_{t};\mbox{\boldmath$\Theta$}_{t})}{r(\mbox{\boldmath$y$}_{t})}\right].$		(10)

We suppose that the variational distribution $r(\mbox{\boldmath$y$}_{t})$ factorizes as

r(\mbox{\boldmath$s$}_{t},\mbox{\boldmath$n$}_{t},\mbox{\boldmath$z$}_{t})=r(\mbox{\boldmath$s$}_{t},\mbox{\boldmath$n$}_{t})r(\mbox{\boldmath$z$}_{t})=\prod_{f}r(s_{ft},n_{ft})\prod_{l}r(z_{lt}).

(11)

We can then iteratively maximize $\mathcal{L}(r,\mbox{\boldmath$\Theta$}_{t})$ with respect to the factorized distributions and the NMF parameters. The variational distributions’ updates are given by (10.9) in [18]

	$\displaystyle\log r(s_{ft},n_{ft})$	$\displaystyle\overset{c}{=}\mathbb{E}_{r(\mbox{\boldmath$z$}_{t})}\log p(x_{ft},s_{ft},n_{ft},\mbox{\boldmath$z$}_{t};\mbox{\boldmath$\Theta$}_{ft})$		(12)
	$\displaystyle\log r(z_{lt})$	$\displaystyle\overset{c}{=}\mathbb{E}_{r(\mbox{\boldmath$s$}_{t},\mbox{\boldmath$n$}_{t})}\log p(\mbox{\boldmath$x$}_{t},\mbox{\boldmath$s$}_{t},\mbox{\boldmath$n$}_{t},z_{lt};\mbox{\boldmath$\Theta$}_{t}).$		(13)

$W$ and $H$ can be updated by maximizing the following:

\mathcal{Q}(\mbox{\boldmath$\Theta$},\mbox{\boldmath$\Theta$}^{old})\overset{c}{=}\mathbb{E}_{r(\mbox{\boldmath$y$};\mbox{\boldmath$\Theta$}^{old})}[\log p(\mbox{\boldmath$x$},\mbox{\boldmath$y$};\mbox{\boldmath$\Theta$})].

(14)

We detail those updates below, and in the supporting document [22].

3.1 E-(s,n) step

We define $\sigma^{2}_{n,ft}=(\mbox{\boldmath$W$}\mbox{\boldmath$H$})_{ft}$ to make the notation less cluttered. Using (12), we find (see [22])

r(\mbox{\boldmath$s$}_{t},\mbox{\boldmath$n$}_{t})\sim\prod_{f}\mathcal{N}_{c}(\mbox{\boldmath$\mu$}_{ft},\mbox{\boldmath$\Sigma$}_{ft})=\mathcal{N}_{c}(\mbox{\boldmath$\mu$}_{t},\mbox{\boldmath$\Sigma$}_{t}),

(15)

where $\mbox{\boldmath$\mu$}_{ft}$ and $\mbox{\boldmath$\Sigma$}_{ft}$ are defined as

\displaystyle\mbox{\boldmath$\Sigma$}_{ft}=\frac{\gamma^{2}_{ft}\sigma^{2}_{n,ft}}{\gamma^{2}_{ft}+\sigma^{2}_{n,ft}}\begin{bmatrix}1&\scalebox{0.75}[1.0]{$-$}1\\ \scalebox{0.75}[1.0]{$-$}1&1\end{bmatrix},\quad\mbox{\boldmath$\mu$}_{ft}=\frac{x_{ft}}{\gamma^{2}_{ft}+\sigma^{2}_{n,ft}}\begin{bmatrix}\gamma^{2}_{ft}\\ \sigma^{2}_{n,ft}\end{bmatrix},

(16)

with

\frac{1}{\gamma^{2}_{ft}}=\mathbb{E}_{r(\mbox{\boldmath$z$}_{t})}\Big{[}1/{\sigma_{f}^{2}(\mbox{\boldmath$z$}_{t})}\Big{]}\approx\sum_{d=1}^{D}\Big{[}1/{\sigma_{f}^{2}(\mbox{\boldmath$z$}_{t}^{(d)})}\Big{]},

(17)

where $\{\mbox{\boldmath$z$}_{n}^{(d)}\}_{d=1,..,D}$ are randomly drawn from $r(\mbox{\boldmath$z$}_{t})$ .

We note $[\mbox{\boldmath$\mu$}_{s,t},\mbox{\boldmath$\mu$}_{n,t}]=\mbox{\boldmath$\mu$}_{t}$ and define $\mbox{\boldmath$\Sigma$}_{ss,t}$ and $\mbox{\boldmath$\Sigma$}_{nn,t}$ to be the diagonal terms of $\mbox{\boldmath$\Sigma$}_{t}$ .

3.2 E-z step

We can compute $r(\mbox{\boldmath$z$}_{t})$ using (13) (see [22]), we get

	$\displaystyle\log r(\mbox{\boldmath$z$}_{t})$	$\displaystyle\overset{c}{=}\log p(\mbox{\boldmath$z$}_{t})+\log p(\mbox{\boldmath$s$}_{t}=\left(\|\mbox{\boldmath$\mu$}_{s,t}\|^{2}+\mbox{\boldmath$\Sigma$}_{ss,t}\right)^{\frac{1}{2}}\|\mbox{\boldmath$z$}_{t})$
		$\displaystyle=\log p(\mbox{\boldmath$z$}_{t}\|\mbox{\boldmath$s$}_{t}=\left(\|\mbox{\boldmath$\mu$}_{s,t}\|^{2}+\mbox{\boldmath$\Sigma$}_{ss,t}\right)^{\frac{1}{2}}),$		(18)

where (18) was obtained using Bayes’ theorem. The VAE assumes that $p_{\theta}(\mbox{\boldmath$z$}_{t}|\mbox{\boldmath$s$}_{t})$ can be appproximated by $q_{\phi}(\mbox{\boldmath$z$}_{t}|\mbox{\boldmath$s$}_{t})$ . We further assume that this still holds for $\mbox{\boldmath$s$}_{t}$ of the form $\left(|\mbox{\boldmath$\mu$}_{s,t}|^{2}+\mbox{\boldmath$\Sigma$}_{ss,t}\right)^{\frac{1}{2}}$ . That is to say

	$\displaystyle r(z_{lt})$	$\displaystyle\approx q_{\phi}(z_{lt}\|\mbox{\boldmath$s$}_{t}=\left(\|\mbox{\boldmath$\mu$}_{s,t}\|^{2}+\mbox{\boldmath$\Sigma$}_{ss,t}\right)^{\frac{1}{2}})$		(19)
		$\displaystyle=\mathcal{N}(\tilde{\mu}_{l}(\|\mbox{\boldmath$\mu$}_{s,t}\|^{2}+\mbox{\boldmath$\Sigma$}_{ss,t}),\tilde{\sigma}_{l}^{2}(\|\mbox{\boldmath$\mu$}_{s,t}\|^{2}+\mbox{\boldmath$\Sigma$}_{ss,t})).$		(20)

3.3 M-step

We now maximize $\mathcal{L}(r,\mbox{\boldmath$\Theta$}_{t})$ with respect to $\mbox{\boldmath$\Theta$}_{t}$ using (14). With $\mbox{\boldmath$V$}\in\mathbb{R}_{+}^{F\times N}$ defined as $(\mbox{\boldmath$V$})_{t}=|\mbox{\boldmath$\mu$}_{n,t}|^{2}+\mbox{\boldmath$\Sigma$}_{nn,t}$ , and $\odot$ denoting element-wise matrix multiplication and exponentiation, we obtain multiplicative updates as in [23]:

\mbox{\boldmath$H$}\leftarrow\mbox{\boldmath$H$}\odot\frac{\mbox{\boldmath$W$}^{T}\Big{(}(\mbox{\boldmath$W$}\mbox{\boldmath$H$})^{\odot-2}\odot\mbox{\bf V}\Big{)}}{\mbox{\boldmath$W$}^{T}(\mbox{\boldmath$W$}\mbox{\boldmath$H$})^{\odot-1}},

(21)

\mbox{\boldmath$W$}\leftarrow\mbox{\boldmath$W$}\odot\frac{\Big{(}(\mbox{\boldmath$W$}\mbox{\boldmath$H$})^{\odot-2}\odot\mbox{\bf V}\Big{)}\mbox{\boldmath$H$}^{T}}{(\mbox{\boldmath$W$}\mbox{\boldmath$H$})^{\odot-1}\mbox{\boldmath$H$}^{T}}.

(22)

3.4 Speech Reconstruction

Let $\mbox{\boldmath$\Theta$}^{\star}=\{\mbox{\boldmath$W$}^{\star},\mbox{\boldmath$H$}^{\star}\}$ and $r^{\star}(\mbox{\boldmath$s$},\mbox{\boldmath$b$},\mbox{\boldmath$z$})=r^{\star}(\mbox{\boldmath$s$},\mbox{\boldmath$b$})r^{\star}(\mbox{\boldmath$z$})$ be respectively the set of NMF parameters and the variational distribution estimated by the proposed algorithm. The final estimate of the source is given, as in [8], by

	$\displaystyle\hat{s}_{ft}$	$\displaystyle=\mathbb{E}_{p(s_{ft}\|x_{ft};\mbox{\boldmath$\Theta$}^{\star})}[s_{ft}]$		(23)
		$\displaystyle=\mathbb{E}_{p(\mbox{\boldmath$z$}_{t}\|x_{ft};\mbox{\boldmath$\Theta$}^{\star})}\Bigg{[}\frac{\sigma^{2}_{f}(\mbox{\boldmath$z$}_{t})}{\sigma^{2}_{f}(\mbox{\boldmath$z$}_{t})+(\mbox{\boldmath$W$}\mbox{\boldmath$H$})_{ft}}\Bigg{]}x_{ft}.$		(24)

There are three ways to evaluate this estimate. The straightforward way is to replace $p(s_{ft}|x_{ft};\mbox{\boldmath$\Theta$}^{\star})$ in (23) by its variational approximation $r^{\star}(s_{ft})$ . The expectation over $p(\mbox{\boldmath$z$}_{t}|x_{ft};\mbox{\boldmath$\Theta$}^{\star})$ can also be approximated using the Metropolis Hastings (MH) algorithm as in [8], using the mean of $r^{\star}(\mbox{\boldmath$z$}_{t})$ as the initial sample. And finally we can compute the expectation in (24) by replacing $p(\mbox{\boldmath$z$}_{t}|x_{ft};\mbox{\boldmath$\Theta$}^{\star})$ by its variational approximation $r^{\star}(\mbox{\boldmath$z$}_{t})$ . We will respectively refer to these methods as S-Wiener, MH-Wiener and Z-Wiener for ease of referencing.
Metropolis-Hastings : At the $m$ -th iteration of the Metropolis-Hastings algorithm, we draw a random sample for each $t$ according to

\mbox{\boldmath$z$}_{t}|\mbox{\boldmath$z$}^{(m-1)}_{t};\epsilon_{mh}^{2}\sim\mathcal{N}(\mbox{\boldmath$z$}^{(m-1)}_{t},\epsilon_{mh}^{2}\mbox{\bf I}).

(25)

The acceptance probability can be computed as

\alpha=\min\Bigg{(}1,\frac{p(\mbox{\boldmath$x$}_{t}|\mbox{\boldmath$z$}_{t};\mbox{\boldmath$\Theta$}^{\star})p(\mbox{\boldmath$z$}_{t})}{p(\mbox{\boldmath$x$}_{t}|\mbox{\boldmath$z$}^{(m-1)}_{t};\mbox{\boldmath$\Theta$}^{\star})p(\mbox{\boldmath$z$}^{(m-1)}_{t})}\Bigg{)}.

(26)

We accept the new sample and set $\mbox{\boldmath$z$}^{(m)}_{t}=\mbox{\boldmath$z$}_{t}$ only if $u$ , drawn from a uniform distribution $\mathcal{U}([0,1])$ , is smaller than $\alpha$ . We keep only the last $R$ samples to compute $\hat{s}_{ft}$ from (24) and discard the samples drawn during the burn-in period.

4 Experimental evaluation

4.1 Experimental settings

Dataset: As in [8, 9], we use the the TIMIT corpus [24] to train the clean speech model. At inference time, we mix speech signals from the TIMIT test set with noise signals from the DEMAND corpus [25] at 0dB signal-to-noise ratio, according to the file provided in in the original implementation of [8] ¹¹1https://gitlab.inria.fr/sileglai/mlsp2018. Note that both speakers and utterances are different from those in the training set.

VAE’s architecture and training: The architecture of the VAE is the same as in [8]. The hyperbolic tangent of an input 128-dimensional linear layer passes through two L-dimensional linear layers to estimate $\tilde{\mu}_{l}({|\mbox{\boldmath$s$}_{t}}|^{2})$ and $\log\tilde{\sigma}_{l}^{2}({|\mbox{\boldmath$s$}_{t}|^{2}})$ , from which $\mbox{\boldmath$z$}_{t}$ is sampled using the reparametrization trick. $\mbox{\boldmath$z$}_{t}$ is then fed to a 128-dimensional linear layer with hyperbolic tangent activation followed by an linear output layer predicting $\log\sigma_{f}^{2}(\mbox{\boldmath$z$}_{t})$ . For training, we computed the STFT of the utterances with a 1024-points sine window and a hop size of 256 points. 20 $\%$ of the training set is held for validation. A VAE with $L=64$ is trained with the Adam optimizer [26], using a learning rate of $10^{-3}$ and a batch size of 128. Training stops if no improvement is seen on the validation loss for 10 epochs.

Baseline methods: We compare our method to two baselines. The first baseline, developed in [8], shares the same statistical assumptions made in Section 2 except for the inclusion of a gain factor in (9). The inference is performed using a Monte Carlo Expectation-Maximization (MCEM) algorithm in which MH sampling is used to estimate the true posterior. We will refer to this method as MCEM. The second baseline is a heuristic we introduce to evaluate the impact of the covariance term $\mbox{\boldmath$\Sigma$}_{ss,t}$ in (20) on the convergence of the algorithm. We simply remove $\mbox{\boldmath$\Sigma$}_{ss,t}$ from (20) and the rest of the algorithm remains unchanged. Note that this is similar to the heuristic developped in [15] for determined multichannel source separation, only adapted to single-channel speech enhancement.

Refer to caption — Figure 1: SDR as a function of number of samples M for an input SNR of 0dB. using MH for reconstruction. Heuristic, MCEM and VEM methods are defined in Section 4.1. Errors bars indicate 95% confidence interval.

Inference Setting: For fair comparaison, we use the same settings for the three methods : the rank of the NMF model is set to $K=10$ , the latent dimension of the VAE is set to $L=64$ and the same stopping criterion is used as in [8]. $\mbox{\boldmath$z$}_{t}$ is initialized based on $q_{\phi}(\mbox{\boldmath$z$}_{t}|\mbox{\boldmath$x$}_{t})$ and for the final speech estimate, equation (24) is approximated using the last 25 samples of a 100-iteration MH sampling with $\epsilon^{2}=0.01$ . In [8], for $R$ samples used, the total number of draws is $4R$ . The methods will be compared for a same number of samples actually used to estimate the expected values : if $D$ samples are used to evaluate (17), $4D$ samples will be drawn in the MCEM case.

4.2 Results

Experiment 1: We first compare the performances of the three methods on the test set described above, in terms of Signal to Distorsion Ration (SDR) [27], computed using the mir_eval²²2https://github.com/craffel/mir_eval toolbox. The comparison is done for different numbers of samples $D\in\{1,2,5,10\}$ . As can be seen in Fig. 1, the heuristic algorithm consistently performs worse than both other methods, proving the importance of the covariance term $\mbox{\boldmath$\Sigma$}_{ss,t}$ in (20). We also see that the performances of the proposed algorithm do not depend on the number of samples drawn to approximate (17), contrary to the MCEM algorithm which presents poor performances for $R=1$ and outperforms the proposed method for $R=10$ . The superiority of the MCEM algorithm for a large number of samples is not surprising due to the approximation involved in the VEM approach.

Experiment 2: We then compare the SDR achieved by the MCEM and VEM methods as a function of the computational time, in terms of number of iteration and absolute time. The number of samples $D$ was set to 1 for the VEM, and $R$ to 5 for the MCEM so as to achieve a similar SDR at convergence (see Fig. 1). At each iteration of both methods, we estimate the speech spectra by approximating (24) with MH sampling and we compute the SDR. This is done for each test utterance. Each individual SDR curve is then padded with the SDR obtained at the last iteration and all SDR curves are averaged to produce the left graph in Fig. 2. We observe that the proposed method converges faster which suggests that using the encoder allows for bigger jumps in the latent space than the sampling method with a small number of samples. On a 4-core i7-8650U CPU, one iteration takes on average 55 ms and 753 ms for the VEM and the MCEM algorithm respectively. Using these numbers, we can convert the SDR as a function of the number of iterations to the SDR as a function of time. This is shown in the right part of Fig. 2 where the superiority of the proposed algorithmm in terms of absolute speed is clear. To compare the execution times, for each utterance and for both algorithms we compute the number of iteration needed to reach the final SDR up to 0.5dB tolerance. The ratio between the number of iterations multiplied by the time per iteration ratio yields an average computational cost decrease factor of 36 to reach the same performance.

Experiment 3: In this experiment, we compare the three different ways of computing the source STFT coefficients in (23). We present the mean SDR after convergence for the three possible methods in Table 1. We can see that MH sampling consistently outperforms the other methods, suggesting that a better posterior approximation benefits to the quality of the reconstruction.

	MH-Wiener	S-Wiener	Z-Wiener
VEM	11.8	11.4	11.4
Heuristic	6.4	5.8	5.8
MCEM	11.9	/	/

Table 1: SDR (dB) as a function of the reconstruction method. MH-Wiener, S-Wiener and Z-Wiener methods are defined in Section 3.4

5 Conclusion

In this paper, we proposed a speech enhancement method based on a variational EM algorithm. Our main contribution is the analytical derivation of the variational steps in which the encoder of the pre-learned VAE can be used to estimate the variational approximation of the true posterior distribution, using the very same assumption made to train VAEs. Experimental results showed that the principled approach outperforms its heuristic counterpart, and produces results on par with the algorithm proposed by Leglaive et al. [8] in which the true posterior is approximated using MH sampling. Additionally, the proposed algorithm converges 36 times faster. Future work includes multichannel and multisource extension of this work. The use of other statistical models of speech spectra [28, 29, 30] will also be explored.

References

[1] P. C. Loizou, Speech Enhancement: Theory and Practice, 2nd ed. Boca Raton, FL, USA: CRC Press, Inc., 2013.
[2] E. Vincent, T. Virtanen, and S. Gannot, Audio Source Separation and Speech Enhancement. Wiley, Aug. 2018.
[3] F. Weninger, J. R. Hershey, J. L. Roux, and B. W. Schuller, “Discriminatively trained recurrent neural networks for single-channel speech separation,” 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 577–581, 2014.
[4] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar 2016.
[5] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in INTERSPEECH, 2013.
[6] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source separation with deep neural networks,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 24, no. 9, pp. 1652–1664, 2016.
[7] A. Ozerov and C. Fevotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 3, pp. 550–563, 2010.
[8] S. Leglaive, L. Girin, and R. Horaud, “A variance modeling framework based on variational autoencoders for speech enhancement,” in Proc. IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP), 2018.
[9] S. Leglaive, U. Simsekli, A. Liutkus, L. Girin, and R. Horaud, “Speech enhancement with variational autoencoders and alpha-stable distributions,” in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP). Brighton, United Kingdom: IEEE, 2019, pp. 1–5.
[10] Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, apr 2018, pp. 716–720.
[11] S. Leglaive, L. Girin, and R. Horaud, “Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization,” no. 3, 2018.
[12] K. Sekiguchi, Y. Bando, K. Yoshii, and T. Kawahara, “Bayesian Multichannel Speech Enhancement with a Deep Speech Prior,” Apsipa, no. November, pp. 1233–1239, 2018.
[13] H. Kameoka, L. Li, S. Inoue, and S. Makino, “Semi-blind source separation with multichannel variational autoencoder,” aug 2018.
[14] S. Seki, H. Kameoka, L. Li, T. Toda, and K. Takeda, “Generalized Multichannel Variational Autoencoder for Underdetermined Source Separation,” sep 2018.
[15] L. Li, H. Kameoka, and S. Makino, “Fast MVAE Joint separation and classification of mixed sources based on multichannel variational autoencoder with auxiliary classifier,” CoRR, vol. abs/1812.0, 2018.
[16] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
[17] G. Wei and M. Tanner, “A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms,” Journal of the American Statistical Association, vol. 85, no. 411, pp. 699–704, 1 1990.
[18] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[19] C. P. Robert and G. Casella, Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.
[20] E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley, and M. E. Davies, “Probabilistic modeling paradigms for audio source separation,” in Machine Audition: Principles, Algorithms and Systems, W. Wang, Ed. IGI Global, 2010, pp. 162–185.
[21] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1600.04747, 2016.
[22] M. Pariente, A. Deleforge, and E. Vincent, “A statistically principled and computationally efficient approach to speech enhancement using variational autoencoders : Supporting document,” Inria, Tech. Rep. RR-9268, 2019. [Online]. Available: https://hal.inria.fr/hal-02089062
[23] C. Fevotte, N. Bertin, and J. Durrieu, “Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830, 2009.
[24] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, “Timit acoustic phonetic continuous speech corpus,” Linguistic data consortium, 1993.
[25] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environ- mental noise recordings,” Proc. Int. Cong. on Acoust., 2013.
[26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[27] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, July 2006.
[28] A. A. Nugraha, K. Sekiguchi, and K. Yoshii, “A deep generative model of speech complex spectrograms,” arXiv preprint arXiv:1903.03269, 2019.
[29] P. Magron and T. Virtanen, “Bayesian anisotropic gaussian model for audio source separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 166–170.
[30] A. Liutkus, C. Rohlfing, and A. Deleforge, “Audio source separation with magnitude priors: The beads model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 56–60.