ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

Zehua Chen ¹ , Yihan Wu⁴, Yichong Leng⁵, Jiawei Chen⁶, Haohe Liu⁷, Xu Tan³
Yang Cui², Ke Wang², Lei He², Sheng Zhao², Jiang Bian³, Danilo Mandic¹

{zehua.chen18,d.mandic}@imperial.ac.uk
{xuta,helei}@microsoft.com
¹Imperial College London, ²Microsoft Azure Speech, ³MSRA, ⁴Renmin University of China, ⁵University of Science and Technology of China, ⁶South China University of Technology, ⁷University of SurreyWork done during an internship at Microsoft. Corresponding author: Lei He

Abstract

Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.

1 Introduction

In recent years, text-to-speech (TTS) synthesis (Tan et al., 2021) has witnessed great progress with the development of deep generative models, e.g., auto-regressive models (Oord et al., 2016; Mehri et al., 2017; Kalchbrenner et al., 2018), flow-based models (Rezende & Mohamed, 2015; van den Oord et al., 2018; Kingma & Dhariwal, 2018), variational autoencoders (Peng et al., 2020), generative adversarial networks (Kumar et al., 2019; Kong et al., 2020; Binkowski et al., 2020), and denoising diffusion probabilistic models (DDPMs, diffusion models for short) (Ho et al., 2020; Song et al., 2021b). Based on the mechanism of iterative refinement, DDPMs have been able to achieve a sample quality that matches or even surpasses the state-of-the-art methods in acoustic models (Popov et al., 2021; Huang et al., 2022b), vocoders (Chen et al., 2021; Kong et al., 2021b; Lee et al., 2022a; Chen et al., 2022; Lam et al., 2022), and end-to-end TTS systems (Huang et al., 2022a).

A major disadvantage of diffusion models is the slow inference speed which requires a large number of sampling steps (e.g., $100\sim 1000$ steps) in high-dimensional data space. To address this issue, many works have explored to minimize the number of inference steps (e.g., $2\sim 6$ steps) for TTS (Huang et al., 2022b; Liu et al., 2022c) or vocoder (Chen et al., 2021; Kong et al., 2021b; Lam et al., 2022; Chen et al., 2022). However, the high-dimensional data space cannot be accurately estimated with a few inference steps; otherwise, the sample quality would be degraded.

Instead of following previous methods to minimize the number of inference steps in DDPMs for speedup, we propose to reduce the complexity of data space for DDPM to model. In this way, the model size as well as the number of inference steps can be reduced naturally, resulting in small real-time factor during inference. To the end, we propose ResGrad, a diffusion model that predicts the residual between the output of an existing TTS model and the corresponding ground-truth mel-spectrogram. Compared with synthesizing speech from scratch, predicting the residual is less complex and easier for DDPM to model. Specifically, as shown in Figure 1, we first utilize an existing TTS model such as FastSpeech 2 (Ren et al., 2021) to generate speech. In training, ResGrad is trained to generate the residual, while in inference, the output of ResGrad is added to the original TTS output, resulting in a speech with higher quality.

Refer to caption — Figure 1: Illustration of ResGrad. ResGrad first predicts the residual between the mel-spectrogram estimated by an existing TTS model and the ground-truth mel-spectrogram, and then adds the residual to the estimated mel-spectrogram to get the refined mel-spectrogram.

ResGrad can generate high quality speech with a lightweight model and small real-time factor due to the less complex learning space, i.e., residual space. Meanwhile, ResGrad is a plug-and-play model, which does not require retraining the existing TTS models and can be easily applied to improve the quality of any existing TTS systems.

The main contributions of our work are summarized as follows:

•

We propose a novel method, ResGrad, to speed up the inference for DDPM model in TTS synthesis. By generating the residual instead of generating whole speech from scratch, the complexity of data space is reduced, which enables ResGrad to be lightweight and effective for generating high quality speech with a small real-time factor.
•

The experiments on LJ-Speech, LibriTTS, and VCTK datasets show that compared with other speed-up baselines for DDPMs, ResGrad can achieve higher sample quality than baselines with the same RTF, while being more than 10 times faster than baselines when generating speech with similar speech quality, which verifies the effectiveness of ResGrad.

2 Related Work

Denoising diffusion probabilistic models (DDPMs) have achieved the state-of-the-art generation results in various tasks (Ho et al., 2020; Song et al., 2021b; Kingma et al., 2021; Dhariwal & Nichol, 2021; Ramesh et al., 2022) but require a large number of inference steps to achieve high sample quality. Various methods have been proposed to accelerate the sampling process (Song et al., 2021a; Vahdat et al., 2021; Dhariwal & Nichol, 2021; Zheng et al., 2022; Xiao et al., 2022; Salimans & Ho, 2022).

In the field of speech synthesis, previous works on improving the inference speed of DDPMs can be mainly divided into three categories: 1) The prior/noise distribution used in DDPMs can be improved by leveraging some condition information (Lee et al., 2022a; Popov et al., 2021; Koizumi et al., 2022). Then, the number of sampling steps in inference stage can be reduced by changing the prior/noise distribution from standard Gaussian to reparameterized distribution. 2) The inference process can start from a latent representation which is acquired by adding noise on the additional input, instead of starting from standard Gaussian noise, which can reduce the number of inference steps in sampling process (Liu et al., 2022b). 3) Additional training techniques can be used for DDPMs. For example, DiffGAN-TTS (Liu et al., 2022c) utilizes GAN-based models in the framework of DDPMs to replace multi-step sampling process with the generator.

The above methods improve the inference speed of DDPMs from different perspectives. However, to guarantee the high sample quality of generated speech, sufficient number of sampling steps is required by these methods. Otherwise, the speech quality would be largely degraded.

In this work, instead of generating the whole speech from scratch, we generate the residual between the output of an existing TTS model and the corresponding ground-truth speech. The residual space is much easier than the origin data space for DDPMs to model, and thus a lightweight DDPM model can be trained and inferred with small sampling steps, resulting in a small real-time factor. Note that as we improve the sample quality upon existing TTS models (by predicting the residual), a small number of inference steps are enough to enhance the sample quality of estimated mel-spectrogram from an existing TTS model, and we can achieve the comparable sample quality which is on par with the quality achieved by a larger DDPM model predicting from scratch with much more sampling steps.

Beyond speech synthesis, there are some relevant work to speed up the inference of DDPMs. In Whang et al. (2022), they jointly trained a regression-based model for initial image deblurring and a diffusion model for the residual estimation, which effectively improved the image quality measured by objective distances. We separately train ResGrad to generate residual information, which does not require jointly training with other models and carefully developing new training configurations.

3 Method

In this section, we first give an overview of proposed ResGrad, and then introduce the formulation of denoising diffusion probabilistic models in ResGrad. At last, we discuss the advantages our method.

3.1 Overview of ResGrad

As shown in Figure 1, ResGrad is trained to predict the residual between the mel-spectrogram estimated by an existing TTS model and the ground-truth mel-spectrogram, and then adds the residual to the estimated mel-spectrogram to get the refined mel-spectrogram. The generated residual contains high-frequency details and avoid over-smoothness in the estimated mel-spectrogram, and thus can improve the sample quality. The three mel-spectrograms in Figure 1 show the estimated mel-spectrogram from an existing TTS model (i.e., FastSpeech 2), the residual generated by ResGrad, and the refined mel-spectrogram that is summation of the first two.

3.2 Formulation of ResGrad

We introduce the formulation of ResGrad, which leverages a denoising diffusion probabilistic model to learn the residual between the estimated and ground-truth mel-spectrograms. The residual $x$ is calculated as

x=mel_{GT}-f_{\psi}(y),

(1)

where $mel_{GT}$ denotes the ground-truth mel-spectrogram, $f_{\psi}$ represents an existing TTS model (e.g., FastSpeech 2 (Ren et al., 2021)), $y$ denotes the text input.

In the training stage, we injects standard Gaussian noise $\epsilon\sim\mathcal{N}(0,I)$ into the clean residual data $x_{0}$ according to a predefined noise schedule $\beta$ with $0<\beta_{1}<\dots<\beta_{T}<1$ . At each time step $t\in[1,\dots,T]$ , the destroyed samples are calculated as:

\displaystyle q(x_{t}|x_{0})=\mathcal{N}(x_{t};\sqrt{\bar{\alpha}_{t}}x_{0},(1-\bar{\alpha}_{t})\epsilon).

(2)

where the $\alpha_{t}:=1-\beta_{t}$ , and $\bar{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}$ denotes a corresponding noise level at time step $t$ .

There are different methods to parameterize DDPMs and derive the training objective (Kingma et al., 2021). We follow the previous work Grad-TTS (Popov et al., 2021) to directly estimate the score function $s_{\theta}$ which means the gradient of the noised data log-density $\log q(x_{t}|x_{0})$ with respect to the data point $x_{t}$ : $s(x_{t},t)=\nabla_{x_{t}}\log q(x_{t}|x_{0})=-\frac{\epsilon}{\sqrt{1-\bar{\alpha}_{t}}}$ . Our training objective is then formulated as:

L(\theta)=\mathbb{E}_{x_{0},\epsilon,t}\left\|s_{\theta}(x_{t},t,c)+\frac{\epsilon}{\sqrt{1-\bar{\alpha}_{t}}}\right\|^{2}_{2}.

(3)

In the inference stage, we first estimate the mel-spectrogram $f_{\psi}(y)$ from the text input $y$ with an existing TTS model $f_{\psi}$ . By conditioning on $f_{\psi}(y)$ , we start from standard Gaussian noise $p(x_{T})\sim\mathcal{N}(0,I)$ , and iteratively denoise the data sample to synthesize the residual $x_{0}$ by:

\displaystyle p_{\theta}(x_{0},\cdots,x_{T-1}|x_{T},c)=\prod_{t=1}^{T}p_{\theta}(x_{t-1}|x_{t},c),

(4)

where $c$ is the generated mel-spectrogram $f_{\psi}(y)$ from an existing TTS model. Then, we add the generated residual $\hat{x}_{0}$ and $f_{\psi}(y)$ together as the final output

mel_{ref}=\hat{x}_{0}+f_{\psi}(y),

(5)

where $mel_{ref}$ is the refined mel-spectrogram.

3.3 Advantages of ResGrad

Generally speaking, both iterative TTS models (e.g., GradTTS) and non-iterative TTS models (e.g., FastSpeech 2) have their pros and cons: 1) iterative TTS models can achieve high sample quality, but at the cost of slow inference speed; 2) non-iterative TTS models are fast in inference, but the sample quality may be not good enough. Instead of choosing either iterative or non-iterative method, ResGrad makes full use of them: using a non-iterative model to predict a rough sample, and using an iterative model to predict the residual. We analyze the advantages of ResGrad from the following two aspects:

•

Standing on the shoulders of giants. ResGrad fully leverages the prediction of an existing non-iterative TTS model and just predicts the residual in an iterative way. Compared to previous methods on speeding up diffusion models that predict the data totally from scratch, ResGrad stands on the shoulders of giants and avoids reinventing wheels, which leaves large room for speeding up diffusion models.
•

Plug-and-play. ResGrad does not requires re-training the existing TTS models but just acts like a post-processing module, which can be applied on existing TTS models in a plug-and-play way.

4 Experimental Setup

4.1 Dataset and Preprocessing

Dataset

We conducted our experiments on three open-source benchmark datasets. LJ-Speech (Ito & Johnson, 2017) ¹¹1https://keithito.com/LJ-Speech-Dataset/ is a single-speaker dataset with a sampling rate of $22.05$ kHz. It contains $13,100$ English speech recordings from a female speaker, with around $24$ hours. Following Kong et al. (2021b); Chen et al. (2022), we use the $523$ samples in LJ- $001$ and LJ- $002$ samples for model evaluation, the remaining $12,577$ samples for model training. LibriTTS (Zen et al., 2019) ²²2https://research.google/tools/datasets/LibriTTS/ is a multi-speaker dataset with a sampling rate of $24$ kHz. In total, we use $1046$ speakers from the clean subset (train-clean-100, train-clean-360, dev-clean, test-clean). For each speaker, we randomly select 20 recordings as the test dataset, while the remaining samples are used as training dataset. VCTK (Yamagishi, 2012) ³³3https://datashare.ed.ac.uk/handle/10283/3443 is a multi-speaker dataset with a sampling rate of $48$ kHz. We extract $108$ speakers from the datasets, and the first $5$ recordings of each speaker ( $540$ recordings in total) are used as the test dataset. The remaining samples are used for training.

Preprocessing

We use the open-source tools (Park & Kim, 2019) to convert the English graphme sequence to phoneme sequence, and then transform the raw waveform of above three datasets into mel-spectrogram following the common practice. Specifically, for LJSpeech, we extract $80$ -band mel-spectrogram with the FFT $1024$ points, $80$ Hz and $7600$ Hz lower and higher frequency cutoffs, and a hop length of $256$ (Ren et al., 2021; Huang et al., 2022b; Popov et al., 2021). For LibriTTS, we extract $80$ -band mel-spectrogram with the FFT $2048$ points, $80$ Hz and $7600$ Hz lower and higher frequency cutoffs, and a hop length of $300$ (Lee et al., 2022b; Kim et al., 2020). For VCTK, we use $128$ -band mel-spectrogram with the FFT $2048$ points, $0$ Hz and $24000$ Hz lower and higher frequency cutoffs, and a hop length of $480$ (Jang et al., 2021; Liu et al., 2022a).

Calculation of Residual

We calculate the residual between the mel-spectrogram predicted by FastSpeech 2 (Ren et al., 2021) and the ground-truth mel-spectrogram. As FastSpeech 2 utilizes ground-truth pitch/duration in training while predicted pitch/duration in inference, we need to consider whether use the ground-truth or predicted pitch/duration to calculate the residual. For duration, we should use ground-truth duration to predict mel-spectrogram for residual calculation. Otherwise there will be length mismatch between the predicted and ground-truth mel-spectrograms. For pitch, we compare the residual calculated by using ground-truth and predicted pitch, as shown in Figure 2(a) and Figure 2(b). In Figure 2(a), the residual represents more about the high-frequency details in ground-truth mel-spectrogram that are difficult to be predicted by TTS models. In this way, the residual generated ResGrad mainly adds the high-frequency details that can improve the samples quality, e.g., the clearness and the naturalness, instead of changing the pitch of estimated mel-specotogram. However, if we calculate the residual with predicted pitch as shown in Figure 2(b), ResGrad will simultaneously correct the pitch in estimated samples towards the ground-truth samples, which will increase the burden of ResGrad. Thus, we use ground-truth pitch to calculate the residual. Note that in inference, the estimated mel-spectrogram is predicted by FastSpeech 2 with both predicted duration and pitch, instead of ground-truth pitch and duration.

4.2 Model Configuration

We choose FastSpeech 2 (Ren et al., 2021) as the non-iterative based TTS model, which is composed of 6 feed-forward Transformer blocks with multi-head self-attention and 1D convolution on both phoneme encoder and mel-spectrogram decoder. In each feed-forward Transformer, the hidden size of multi-head attention is set to 256 and the number of head is set to 2. The kernel size of 1D convolution in the two-layer convolution network is set to 9 and 1, and the input/output size of the number of channels in the first and the second layer is 256/1024 and 1024/256. As for the duration predictor and variance adaptor, the architecture we use is exactly the same as that in FastSpeech2 (Ren et al., 2021), which are composed of stacks of several convolution networks and the final linear projection layer. The convolution layers of the duration predictor and variance adaptor are set to 2 and 5, the kernel size is set to 3, the input/output size of all layers is 256/256, and the dropout rate is set to 0.5. ResGrad is a U-Net architecture, which mainly borrows the implementation of the official open-source in Grad-TTS (Popov et al., 2021), except that we have smaller model parameters.

For LJ-Speech and LibriTTS, we use the publicly available pretrained HiFi-GAN (Kong et al., 2020)⁴⁴4https://github.com/kan-bayashi/ParallelWaveGAN. For VCTK, we trained the HiFi-GAN to generate $48$ kHz waveform with the open-source code⁵⁵5https://github.com/jik876/hifi-gan. In experiments, the same vocoder is used for each of the baseline TTS models.

4.3 Training and Evaluation

For the training of non-iterative based TTS model, we use the AdamW optimizer with $\beta_{1}=0.8$ , $\beta_{2}=0.99$ , weight delay $\lambda=0.01$ , and follow the same learning rate plan schedule in Vaswani et al. (2017). We use 8 NVIDIA V100 GPUs, the batch size is set to 32 per GPU. It takes 100k steps until moderate quality speech can be generated.

For the training of ResGrad, the Adam optimizer is used and the learning rate is set to 0.0001, The number of diffusion time steps is set as T = 1000. We use 8 NVIDIA V100 GPUs, the batch size is set to 16 per GPU. we restrict the max length of each sample as 4 seconds to enlarge the batch size. And we find that the strong sample quality improvement can be achieved after 500k training steps.

For evaluation, we mainly choose human subjective evaluation: MOS (Mean Opinion Score) and CMOS (Comparison Mean Opinion Score) tests. We invite $14$ native English speakers as judges to give the MOS for each utterance to evaluate the overall sample quality with a 5-point scale. Each CMOS result is given by $14$ judges per utterance for comparing the samples generated by the two different models. For measuring the inference speed, we use the objective measurement real-time factor (RTF) calculated on an NVIDIA V $100$ GPU device.

4.4 Training of Baseline Models

For a fair and reproducible comparison, we train several baseline models with the same dataset, i.e. the single speaker dataset LJSpeech and the multi-speaker dataset VCTK and LibriTTS. The details of the training settings of these baseline models are: 1) DiffSpeech is an acoustic model for TTS based on diffusion model. We train DiffSpeech following Liu et al. (2022b), and the configurations are adjusted accordingly on different data set. For DiffSpeech, the training has two stages: warmup stage and main stage. Warmup stage trains the auxiliary decoder for 160k steps, which is FastSpeech 2 (Ren et al., 2021) here. Main stage trains DiffSpeech for 160k steps until convergence. 2) DiffGAN-TTS. Following Liu et al. (2022c), we train DiffGAN-TTS with T=1, 2, and 4. For both the generator and the discriminator, we use the Adam optimizer, with $\beta_{1}=0.5$ and $\beta_{2}=0.9$ . Models are trained using one NVIDIA V100 GPU. We set the batch size as 64, and train models for 300k steps until loss converge. 3) ProDiff (Huang et al., 2022b) is an progressive fast diffsion model for high-quality text-to-speech. It is distilled from the N-step teacher model, which further decreases the sampling time. Following Huang et al. (2022b), we first train the ProDiff teacher with 4 diffusion steps. Then, we take the converged teacher model to train ProDiff with 2 diffusion steps. Both the ProDiff teacher model and ProDiff model are trained for 200,000 steps by using one NVIDIA V100 GPU. 4) GradTTS. Following Popov et al. (2021), we train model on original mel-spectrograms. Grad-TTS was trained for 1,700k steps on 8 NVIDIA V100GPU with a batch size of 16. We chose Adam optimizer and set the learning rate to 0.0001. 5) FastSpeech 2. Following Ren et al. (2021), we train FastSpeech 2 with a batch size of 48 sentences. FastSpeech 2 is trained for 160k steps until convergence by using 8 NVIDIA V100 GPU.

Table 1: The MOS and RTF results of different methods on LJSpeech, LibriTTS, and VCTK datasets. Note that GradTTS-50, ResGrad-4, and ResGrad-50 mean using 50, 4, and 50 iteration steps respectively. The MOS scores on LibriTTS dataset is lower than LJSpeech and VCTK. There are two main reasons. Firstly, the speech samples in that dataset may contain background noise, which decreases the quality of synthesized samples. Secondly, LibriTTS contains

1046

speakers. As other models are generating TTS samples from scratch, the synthesis task is challenging. In comparison, generating the residual part of each speaker becomes easier thus achieving better sample quality.

	LJSpeech		LibriTTS		VCTK
Models	MOS ( $\uparrow$ )	RTF ( $\downarrow$ )	MOS ( $\uparrow$ )	RTF	MOS ( $\uparrow$ )	RTF
Recordings	$4.91\pm 0.02$	$-$	$4.4\pm 0.05$	$-$	$4.75\pm 0.02$	$-$
GT mel $+$ Vocoder	$4.28\pm 0.06$	$-$	$4.5\pm 0.04$	$-$	$4.61\pm 0.04$	$-$
FastSpeech 2	$3.29\pm 0.10$	$0.003$	$2.58\pm 0.13$	$0.003$	$3.14\pm 0.11$	$0.004$
DiffGAN-TTS	$2.14\pm 0.09$	$0.009$	$-$	$-$	$-$	$-$
DiffSpeech	$3.78\pm 0.11$	$0.182$	$2.50\pm 0.13$	$0.263$	$3.44\pm 0.10$	$0.857$
ProDiff	$3.14\pm 0.09$	$0.033$	$2.75\pm 0.11$	$0.061$	$3.29\pm 0.09$	$0.235$
GradTTS-50	$4.14\pm 0.09$	$0.193$	$2.38\pm 0.18$	$0.259$	$3.85\pm 0.09$	$0.292$
ResGrad-4	$4.07\pm 0.11$	$0.018$	$3.00\pm 0.16$	$0.020$	$3.78\pm 0.11$	$0.022$
ResGrad-50	$4.14\pm 0.09$	$0.169$	$3.41\pm 0.18$	$0.223$	$\bm{3.93\pm 0.09}$	$0.244$

5 Results

5.1 Speech Quality and Inference Speed

We compare ResGrad with several baselines, including 1) Recording: the ground-truth recordings; 2) GT mel + Vocoder: using ground-truth mel-spectrogram to synthesize waveform with HiFi-GAN vocoder (Kong et al., 2020); 3) FastSpeech 2 (Ren et al., 2021): one of the most popular non-autoregressive TTS model which can synthesis high quality speech in fast speed, 4) DiffGAN-TTS (Liu et al., 2022c) ⁶⁶6https://github.com/keonlee9420/DiffGAN-TTS, a DDPM-based text-to-speech model which speeds up inference with GANs, 5) DiffSpeech (Liu et al., 2022b) ⁷⁷7https://github.com/MoonInTheRiver/DiffSinger, a DDPM-based text-to-speech model which speed up inference by a shallow diffusion mechanism; 6) ProDiff (Huang et al., 2022b)⁸⁸8https://github.com/Rongjiehuang/ProDiff, an progressive fast diffusion model for text-to-speech, 7) GradTTS (Popov et al., 2021)⁹⁹9https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS, a DDPM-based TTS model which can synthesis high-quality speech. FastSpeech 2 is the non-iterative TTS model used in this work. Other baselines provide different methods to accelerate the inference speed of DDPMs.

We maintain the consistency of experimental settings between different methods to exclude other interference factors, and re-produce the results of these methods on the three open-source datasets. The MOS test and RTF results are shown in Table 1 (ResGrad-4 and ResGrad-50 represent using 4 and 50 iteration steps respectively). We have several observations:

•

Voice quality. Our ResGrad-50 achieves the best MOS in all the three datasets, especially in multi-speaker and high sampling rate datasets, i.e. LibriTTS and VCTK. In LJSpeech, ResGrad-50 outperforms all baselines, and achieves on-par MOS with GradTTS-50, but with a faster inference speed. Also, ResGrad-4 outperforms all other baselines except GradTTS-50 in LJSpeech and VCTK. In LibriTTS, ResGrad-4 even performs better than GradTTS-50.
•

Inference speed. Our ResGrad-4 outperforms other baseline considering both inference speed and voice quality. Compare with GradTTS-50, both ResGrad-4 and ResGrad-50 achieves faster inference speed with on-par or even better speech quality. Compared with ProDiff and DiffSpeech, ResGrad-4 reduces the inference time, as well as achieving better voice quality in the three datasets. The inference speed of DiffGAN-TTS is the fastest, but the voice quality is very poor.

As a summary, the quality of our ResGrad with 4-step iteration is much higher than that of the non-iterative based TTS method (e.g., FastSpeech 2) and other speech-up method of DDPMs (e.g., DiffGAN-TTS, DiffSpeech and ProDiff). With the same iteration step, our ResGrad can match or even exceed the quality of Grad-TTS model, but enjoys a smaller model and lower RTF. It is worth mentioning that our advantages are more obvious in more challenging datasets with multiple speakers and high sampling rate.

The comments from judges can be found in Appendix A.

5.2 Ablation Study

Residual Calculation

As described in Section 4.1, there are two choices for calculating the residual: using $pitch_{GT}$ or $pitch_{Pred}$ in FastSpeech 2. We show both of the calculated residual samples and the generated one in Figure 3. As it can be seen, when we use $pitch_{Pred}$ , the residual related to pitch information is not generated as it is in the ground-truth residual samples. In inference stage, it means the model can not correct the pitch towards the ground-truth data samples. Moreover, it may produce distortions in synthesized samples. We show the quality comparison between these two settings in Table 5.2.

Table 2: The effect of GT pitch information.

Model	CMOS ( $\uparrow$ )
ResGrad	$0$
ResGrad $-$ $pitch_{GT}$	$-0.17$

Table 3: Comparison between ResGrad and ResUNet.

Model	CMOS ( $\uparrow$ )	Size (M)
ResGrad	$0.00$	$2.03$
ResUNet	$-0.18$	$31.8$

Residual Prediction

It is natural to consider if other models can be used for predicting the residual information and effectively improving the sample quality. Following Liu et al. (2021), we use ResUNet (Kong et al., 2021a), which is an improved UNet (Ronneberger et al., 2015), to model the residual with the generated mel-spectrogram $f_{\psi}(y)$ . The ResUNet consists of several encoder and decoder blocks, with skip connections between encoder and decoder blocks at the same level. Both encoder and decoder is a series of residual convolutions, and each convolutional layer in ResConv consists of a batch normalization, a leakyReLU activation, and a linear convolutional operation. Therefore, we employ the Residual UNet (ResUNet) network to predict the residual information for comparison. ResUNet has achieved promising results in predicting the high-frequency information of mel-spectrogram (Liu et al., 2022a) in speech super-resolution task (Wang & Wang, 2021). Compared with ResGrad, a considerably larger ResUNet model (31.8 M) is used for generating the residual. We show the comparison of sample quality, model size, and inference speed between ResGrad and ResUNet in Table 5.2. It can be seen that ResGrad achieves higher sample quality but with much smaller model size, demonstrating the effectiveness of our method.

Table 4: CMOS results on LibriTTS.

Model	CMOS ( $\uparrow$ )	Size (M)	RTF ( $\downarrow$ )
ResGrad	$0.00$	$2.03$	$0.223$
ResGrad-L	$0.09$	$7.68$	$0.265$

Table 5: CMOS results on VCTK.

Model	CMOS ( $\uparrow$ )	RTF ( $\downarrow$ )
ResGrad	$0$	$0.244$
ResGrad-L	$0.02$	$0.294$

Model Size

The performance of ResGrad on improving sample quality is not sensitive to the model size. We verify it by employing a large diffusion model (i.e., the decoder model in Grad-TTS (Popov et al., 2021)) in ResGrad for comparison. We represent it as ResGrad-L. The comparison results are shown in Table 5.2 and Table 5.2. With a same number of inference steps, $50$ , it achieves similar CMOS with ResGrad on both LibriTTS and VCTK. However, the inference speed becomes slower.

5.3 Case Study

As shown in Figure 4, we conduct case study and visualize the mel-spectrograms from ground truth recording and generated by different TTS models, i.e. FastSpeech 2 and ResGrad. We can observe that FastSpeech 2 tend to generate mel-spectrograms which are over-smoothing. Compared with FastSpeech 2, ResGrad can generates mel-spectrograms with more detailed information. Therefore, ResGrad can synthesis more expressive and higher quality speech.

Figure 5 shows the inference process of ResGrad, which contains $4$ sampling steps to generate the clean residual data from the standard Gaussian noise. As can be seen, the noise are effectively removed in each sampling step. Although more details of residual samples will be generated with more inference steps, clean residual samples can be generated with a few sampling steps, which has been able to improve the TTS sample quality.

6 Conclusion

In this work, to accelerate the inference speed of diffusion-based TTS models, we propose ResGrad which models the residual between the output of an existing TTS model and the corresponding ground-truth. Different from all the previous works on speeding up diffusion models that learn the whole data space of speech, ResGrad learns the residual space that is much easier to model. As a result, ResGrad just requires a lightweight model with a few iterative steps to synthesize high-quality speech in a low real-time factor. Experiments on three datasets show that ResGrad is able to synthesize higher quality speech than baselines under same real-time factor and speed up baselines by more than 10 times under similar synthesized speech quality. For future work, we will investigate to apply the idea of ResGrad on other TTS model architectures or other tasks such as image synthesis.

References

Binkowski et al. (2020) Mikolaj Binkowski, Jeff Donahue, Sander Dieleman, Aidan Clark, Erich Elsen, Norman Casagrande, Luis C. Cobo, and Karen Simonyan. High fidelity speech synthesis with adversarial networks. In ICLR, 2020.
Chen et al. (2021) Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In ICLR, 2021.
Chen et al. (2022) Zehua Chen, Xu Tan, Ke Wang, Shifeng Pan, Danilo P. Mandic, Lei He, and Sheng Zhao. Infergrad: Improving diffusion models for vocoder by considering inference in training. In ICASSP, 2022.
Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
Huang et al. (2022a) Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. In IJCAI, 2022a.
Huang et al. (2022b) Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Chenye Cui, and Yi Ren. Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In ACM Multimedia Conference, 2022b.
Ito & Johnson (2017) Keith Ito and Linda Johnson. The LJ speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
Jang et al. (2021) Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, and Juntae Kim. Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. In Interspeech, 2021.
Kalchbrenner et al. (2018) Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In ICML, 2018.
Kim et al. (2020) Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020.
Kingma & Dhariwal (2018) Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In NeurIPS, 2018.
Kingma et al. (2021) Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. In NeurIPS, 2021.
Koizumi et al. (2022) Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, and Michiel Bacchiani. Specgrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping. arXiv preprint arXiv:2203.16749, 2022.
Kong et al. (2020) Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In NeurIPS, 2020.
Kong et al. (2021a) Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang. Decoupling magnitude and phase estimation with deep resunet for music source separation. In ISMIR, pp. 342–349, 2021a.
Kong et al. (2021b) Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR, 2021b.
Kumar et al. (2019) Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C. Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. In NeurIPS, 2019.
Lam et al. (2022) Max W. Y. Lam, Jun Wang, Dan Su, and Dong Yu. Bilateral denoising diffusion models. In ICLR, 2022.
Lee et al. (2022a) Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, and Tie-Yan Liu. Priorgrad: Improving conditional denoising diffusion models with data-driven adaptive prior. In ICLR, 2022a.
Lee et al. (2022b) Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022b.
Liu et al. (2021) Haohe Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, and Yuxuan Wang. Voicefixer: Toward general speech restoration with neural vocoder. CoRR, abs/2109.13731, 2021.
Liu et al. (2022a) Haohe Liu, Woosung Choi, Xubo Liu, Qiuqiang Kong, Qiao Tian, and DeLiang Wang. Neural vocoder is all you need for speech super-resolution. In Interspeech, 2022a.
Liu et al. (2022b) Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In AAAI, 2022b.
Liu et al. (2022c) Songxiang Liu, Dan Su, and Dong Yu. Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans. arXiv preprint arXiv:2201.11972, 2022c.
Mehri et al. (2017) Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron C. Courville, and Yoshua Bengio. Samplernn: An unconditional end-to-end neural audio generation model. In ICLR, 2017.
Oord et al. (2016) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
Park & Kim (2019) Kyubyong Park and Jongseok Kim. g2pe. https://github.com/Kyubyong/g2p, 2019.
Peng et al. (2020) Kainan Peng, Wei Ping, Zhao Song, and Kexin Zhao. Non-autoregressive neural text-to-speech. In ICML, 2020.
Popov et al. (2021) Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail A. Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. In ICML, 2021.
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Ren et al. (2021) Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In ICLR, 2021.
Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pp. 234–241. Springer, 2015.
Salimans & Ho (2022) Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
Song et al. (2021a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021a.
Song et al. (2021b) Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
Tan et al. (2021) Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021.
Vahdat et al. (2021) Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. In NeurIPS, 2021.
van den Oord et al. (2018) Aäron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel wavenet: Fast high-fidelity speech synthesis. In ICML, 2018.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
Wang & Wang (2021) Heming Wang and DeLiang Wang. Towards robust speech super-resolution. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2058–2066, 2021.
Whang et al. (2022) Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G. Dimakis, and Peyman Milanfar. Deblurring via stochastic refinement. In CVPR, 2022.
Xiao et al. (2022) Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans. In ICLR, 2022.
Yamagishi (2012) Junichi. Yamagishi. English multi-speaker corpus for cstr voice cloning toolkit. https://datashare.ed.ac.uk/handle/10283/3443, 2012.
Zen et al. (2019) Heiga Zen, Rob Clark, Ron J. Weiss, Viet Dang, Ye Jia, Yonghui Wu, Yu Zhang, and Zhifeng Chen. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech, 2019.
Zheng et al. (2022) Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. arXiv preprint arXiv:2202.09671, 2022.

Appendix A Human Evaluations and Comments

We list some comments about the perceptual quality of different models given by native English speakers. All of them are not familiar with TTS synthesis task.

ResGrad-4

“The model achieves high quality acoustics, limited articulation in speech, with very low (but audible) hiss in some recordings.”; “pauses too long between the words”.

ResGrad-50

“Words are really clear and understandable”; “slight less sharp compared to ground truth”; “much better acoustics - hiss level is reduced”; “Sounds robotic with buzzing noise present in some samples, although slower speech”.

DiffGAN-TTS

“You can easily understand the words and distinguish them but the sound is quite distorted”; “Sounds distorted”.

DiffSpeech

“low but noticeable hiss levels on most recordings, some recordings are buzzy. Pitch seems to be higher/speech is faster - is the sampling rate correct?”; “the speed make the quality lower and the volume drop sometimes”.

FastSpeech 2

“similar comments to 1-1 (ResGrad-4)”; “similar to model 2 (DiffGAN-TTS), there is some good articulation, but very audible buzz that corrupts some of the speech.”

GradTTS-50

“It make longer pauses when needed to better understand the meaning of the speech.”; “articulation is very robotic. However, good acoustic quality (audible but quiet hiss in most recordings)”; “slower and more understandable. Realistic changes in pitch. More natural breaks between words which sound realistic.”.

ProDiff

“Slight background noise but the human voice was mostly good”; “robotic voice and a bit faster of model 1-1 (ResGrad-4) making it a bit less understandable”; “limited articulation high level of crackle in most recordings”.

GT-Mel

“articulation is good. High quality acoustics similar in 1-1 (ResGrad-4)”; “slower and more understandable. Lacks realistic changes in pitch. Slight high frequency artefact”; “very realistic audio with realistic tonal changes, some noise present in audio samples but I could listen to audio books with this voice”.