FreeCodec: A disentangled neural speech codec with fewer tokens

1^st Youqiang Zheng^∗ * Work is done during the internship at Tencent YouTu Lab. School of Computer Science
Wuhan University
Wuhan, China
youqiangzheng@whu.edu.cn 2^nd Weiping Tu^🖂 🖂Corresponding author NERCMS, School of Computer Science
Wuhan University
Wuhan, China
tuweiping@whu.edu.cn 3^rd Yueteng Kang Youtu Lab
Tencent
Shanghai, China
yuetengkang@tencent.com 4^th Jie Chen Youtu Lab
Tencent
Beijing, China
godjchen@tencent.com 5^th Yike Zhang Youtu Lab
Tencent
Beijing, China
yikezhang@tencent.com 6^th Li Xiao School of Computer Science
Wuhan University
Wuhan, China
xiaoli1996@whu.edu.cn 7^th Yuhong Yang School of Computer Science
Wuhan University
Wuhan, China
yangyuhong@whu.edu.cn 8^th Long Ma Youtu Lab
Tencent
Beijing, China
malong@gmail.com

Abstract

Neural speech codecs have gained great attention for their outstanding reconstruction with discrete token representations. It is a crucial component in generative tasks such as speech coding and large language models (LLM). However, most works based on residual vector quantization perform worse with fewer tokens due to low coding efficiency for modeling complex coupled information. In this paper, we propose a neural speech codec named FreeCodec which employs a more effective encoding framework by decomposing intrinsic properties of speech into different components: 1) a global vector is extracted as the timbre information, 2) a prosody encoder with a long stride level is used to model the prosody information, 3) the content information is from a content encoder. Using different training strategies, FreeCodec achieves state-of-the-art performance in reconstruction and disentanglement scenarios. Results from subjective and objective experiments demonstrate that our framework outperforms existing methods. The related code will be released at https://github.com/exercise-book-yq/FreeCodec.

Index Terms:

Speech coding, neural networks, vector quantization.

I Introduction

Neural speech codecs are widely used to compress speech signals for a limited number of bits with minimal distortion. Compared to traditional parametric algorithms [1, 2], it has progressed significantly in medium- or low-bitrate scenarios. With the development of large language models (LLM), the discrete codes of neural speech codecs also play a pivotal role in LLM-driven generative speech models. In LLM-based TTS systems,

Existing mainstream end-to-end (E2E) works [3, 4, 5, 6, 7, 8] rely on the VQ-VAE [9] architecture to learn an encoder, a vector quantizer, and a decoder by data-driven. These techniques utilize vector quantization to compress or discrete the latent features from the encoder. Many researches [10, 7, 11, 12] are based on optimizing vector quantization to improve the reconstructed speech quality. For instance, Soundstream [3] introduces a residual vector quantizer (RVQ) into neural speech codecs to achieve state-of-the-art (SOTA) performance from 3 to 18 kbps. It is more efficient and has lower complexity than the plain vector quantizer. In [10], the group-residual vector quantization (GRVQ) is proposed for enjoying better performance while containing four quantizers at 2 kbps. Furthermore, Descript-audio-codec (DAC) [11] introduces factorized and L2-normalized codes to improve codebook usage, operating a higher compression rate than EnCodec [4]. However, when using two codebooks even less, the performance of these methods is struggling, like content loss resulting in unintelligible.

Refer to caption — (a) The architecture of FreeCodec

Several works [13, 14, 15, 16, 17, 18] have explored speech reconstruction with the disentangled feature under the VQ-VAE paradigm. Similar to voice conversion (VC), these methods disentangle a global speaker identity and content representations. [13] utilizes a pretrained self-supervised learning models to disentangle content information at small datasets. Recently, TiCodec [15] explores an additional global encoder to extract time-invariant information out of speech. It reduces the redundancy of frame-level information to attain improved encoding efficiency and exhibits improved performance using one or two tokens. However, speech includes several attributes(not just global and non-global), and each of them should be modeled using a module [19]. Inspired by this, we explore a more detailed disentanglement of representations framework for better reconstruction. This framework also can be flexibly used in disentanglement scenarios.

In this paper, we propose a more detailed representation of the neural speech codec - FreeCodec. By modeling complex speech into intrinsic attributes(speaker, prosody, and content), it achieves better performance in reconstruction and disentanglement scenarios. Meanwhile, we adopt different frame-level representations for different attributes, enabling more effective quantization and higher compression. Our main contributions are as follows:

•

We propose FreeCodec, a more-grained disentanglement neural speech codec that encodes intrinsic properties in speech with self-supervision.
•

We show that our proposed framework can be flexibly used in reconstruction(e.g., zero-shot TTS, speech coding) and disentanglement(e.g., voice conversion) scenarios when using different training strategies.
•

Our proposed method using approximately 57 tokens per second, surpasses the existing state-of-the-art models in subjective and objective evaluation¹¹1Speech samples are available at: https://exercise-book-yq.github.io/FreeCodec-Demo/.

II Proposed Method

II-A Overall

As illustrated in Fig.1(a), our proposed method consists of three components: encoders, quantizers, and decoders. Unlike existing works, our encoder proposes a more detailed modeling focus on different intrinsic properties in human speech. It can alleviate the pressure on a single encoder component and provides a more detailed capability to model intrinsic speech attributes. Specifically, we introduce three types of encoders to encode content, prosody, and timbre information, respectively. The input raw speech signal is first used to generate three latent feature representations. Then the quantization layers produce compressed representations. Finally, the decoders reconstruct the speech signal from the compressed latent representations. In addition, we utilize different training strategies to provide three versions for reconstruction(zero-shot TTS, speech coding) and disentanglement(voice conversion) scenarios. The details of training strategies are described in Section II-C and II-D.

II-B Encoders

Speaker Encoder Existing approaches extract a global embedding in the encoder in an unsupervised manner, and then the decoder aggregates the global embedding into the frame-level embeddings. They assume that the global embedding can represent time-invariant information, such as the characteristics of the speaker and speaking style. Here, we follow this unsupervised manner and further extract the speaker’s information more precisely. We utilize a pre-trained speaker encoder, ECAPA-TDNN [20], a state-of-the-art speaker recognition network based on convolution neural networks with an attentive statistics pooling layer. The mel-spectrogram sampled from the raw speech signal is fed into the speaker encoder to get one global timbre vector.

Content Encoder The architecture of the content encoder follows the SuperCodec [6] encoder using (2, 4, 5, 8) as strides, a number ${B}_{enc}$ = 4 of convolution blocks. It indicates a total downsampling of 320 times and outputs 256-dimensional content features with a frame rate of 50 Hz from 16 kHz speech. In order to reduce the redundancy of the content encoder, we propose to use a self-supervised model to explicitly model the content information, as shown in Fig.1(a).

Prosody Encoder The prosody encoder extracts the information apart from the speaker and content information, as shown in Fig.1(b). In [21, 22], the first 20 bins in each mel-spectrogram frame are taken as input to extract prosody because it contains almost complete prosody and much less speaker and content information than the full band. Following the related work [22], we adopt the prosody encoder consisting of two convolution stacks, a max pooling layer with a stride of 8 to remove content and speaker information further. Our proposed method sets the FFT and hop size to 1024 and 320. With these setups, the prosody encoder results in roughly a frame rate of 7 Hz feature embeddings with 256 dimensions.

II-C Quantization

We adopt different methods to quantize different features. For the content and prosody information, we adopt a plain vector quantizer with one codebook, and the codebook size is set to 256. As for the speaker embedding, we use two types: continuous representation for FreeCodec-v1 and FreeCodec-v3 and discrete representation for FreeCodec-v2. Specifically, we compress the speaker embedding by group vector quantization (GVQ) in FreeCodec-v2 for speech coding. It divides the speaker embedding into eight groups that are quantized by one codebook with 1024 codebook size respectively. As for FreeCodec-v1 and FreeCodec-v3, we provide the continuous representation to the decoder for better reconstruction in such as zero-shot TTS and voice conversion scenarios, similar to [23, 24].

II-D Decoders and Training Strategy

FreeCodec also employs a mirrored decoder upsampling structure²²2https://github.com/exercise-book-yq/Supercodec. It uses (8, 5, 4, 2) as strides, resulting in a total upsampling of 320 times. However, reconstructing in the high compression rate scenarios is particularly challenging. Before upsampling, we first adopt 4 layers Transformer encoder to enhance the semantic modeling. Then, we use ConvNeXt [26] as a fundamental backbone to condition the prosody and speaker representations.

We incorporate adversarial training to promote perceptual quality, using a multi-scale STFT-based (MS-STFT) discriminator. The training loss of the proposed method comprises five components: reconstruction loss, VQ commitment loss, content loss, feature matching loss, and adversarial losses. The reconstruction loss, feature loss, and adversarial losses follow EnCodec [3].

For the content loss, we extract the last layer representation from a pre-trained WavLM-Large model [27] as the semantic learning target at 50 Hz. In FreeCodec-v1 and FreeCodec-v2, we use it to reduce the redundancy of the content encoder. It maximizes the cosine similarity at the level of the dimensions across all timesteps between the outputs of the content encoder and semantic learning target. In FreeCodec-v3, we only use the semantic learning target at the decoder to prevent additional speaker information from leaking to the content encoder and quantizer. We also utilize spectrogram-resize based data augmentation on the prosody and content encoder in the training [28]. This approach achieves better performance in disentanglement scenarios.

III Experimental Setup

III-A Training Details and Baselines

We trained our model on LibriSpeech [29], which consists of approximately 1000 hours of speech at 16khz. We use train-clean-100, train-clean-360, and train-other-500 subsets. For a fair comparison, we adopt two recent neural codecs, TiCodec³³3https://github.com/y-ren16/TiCodec and Descript-audio-codec (DAC)⁴⁴4https://github.com/descriptinc/descript-audio-codec, which have demonstrated success in the domain of neural speech codecs. The baselines are re-trained with 1 and 2 codebooks, indicating 0.5 kbps and 1 kbps. FreeCodec and TiCodec are trained on two V100 GPUs with 400 k iterations and batchsize of 20 per GPU. DAC is trained on two V100 GPUs with 800 k iterations and batchsize of 10 per GPU. In addition, we also consider several open-source speech codecs as baselines, EnCodec ⁵⁵5https://github.com/facebookresearch/encodec at 3 kbps, and Lyra-v2 ⁶⁶6https://github.com/google/lyra at 3.2 kbps, and Speechtokenizer[16] at 3 kbps, and SemantiCodec[18] at 1.3 kbps, and Wavtokenizer-small[8] at 0.9 kbps. For Encodec and Wavtokenizer-small, we use the 24 kHz pre-trained model to synthesize speech, corresponding to the same compression rate of 2 kbps and 0.6 kbps for the 16 kHz sampling rate, respectively.

As for the voice conversion, three baseline models are selected to be compared with FreeCodec-v3: VQMIVC [31], YourTTS [32], Wav2vec-vc [33]. These models are trained on the VCTK datasets.

TABLE I: The objective evaluation of reconstruction quality based on VCTK and LibriSpeech Test-Clean corpus.

$\blacklozenge$

represents the reproduced model following the official implementation.

$\clubsuit$

means the results inferenced from the official checkpoints. Bold is the best result and underline is the second-best result.

Model	Sampling rate	Bandwidth	Token/s	VCTK				Test-clean
Model	Sampling rate	Bandwidth	Token/s	UTMOS $\uparrow$	STOI $\uparrow$	WARP-Q $\downarrow$	SECS $\uparrow$	UTMOS $\uparrow$	STOI $\uparrow$	WARP-Q $\downarrow$	SECS $\uparrow$
Target	-	-	-	4.085	-	-	-	4.086	-	-	-
EnCodec ${}^{\scalebox{0.6}{$\clubsuit$}}$	24 kHz	3 kbps	300	2.145	0.714	2.578	0.797	1.516	0.821	2.614	0.855
WavTokenizer ${}^{\scalebox{0.6}{$\clubsuit$}}$	24 kHz	0.9 kbps	75	3.296	0.832	2.192	0.811	3.792	0.897	2.135	0.904
SemantiCodec ${}^{\scalebox{0.6}{$\clubsuit$}}$	16 kHz	1.3 kbps	100	3.334	0.853	2.078	0.868	2.922	0.879	2.049	0.936
SpeechTokenizer ${}^{\scalebox{0.6}{$\clubsuit$}}$	16 kHz	3 kbps	300	3.953	0.868	2.048	0.915	3.848	0.908	2.034	0.954
Lyra V2 ${}^{\scalebox{0.6}{$\clubsuit$}}$	16 kHz	3.2 kbps	-	2.887	0.881	2.127	0.890	2.968	0.901	2.075	0.945
TiCodec ${}^{\scalebox{0.6}{$\blacklozenge$}}$	16 kHz	0.5 kbps	50	3.421	0.825	2.578	0.797	3.307	0.821	2.614	0.855
TiCodec ${}^{\scalebox{0.6}{$\blacklozenge$}}$	16 kHz	1 kbps	100	3.584	0.879	2.333	0.856	3.616	0.881	2.354	0.908
DAC ${}^{\scalebox{0.6}{$\blacklozenge$}}$	16 kHz	0.5 kbps	50	3.476	0.852	2.550	0.804	3.543	0.859	2.504	0.883
DAC ${}^{\scalebox{0.6}{$\blacklozenge$}}$	16 kHz	1 kbps	100	3.780	0.904	2.251	0.883	3.790	0.901	2.274	0.920
FreeCodec-v1	16 kHz	0.45 kbps	57	4.034	0.918	1.966	0.919	4.085	0.892	2.195	0.944
w/o. ${L}_{content}$	16 kHz	0.45 kbps	57	3.805	0.908	1.994	0.893	3.631	0.869	2.308	0.925
FreeCodec-v2	16 kHz	0.45 kbps	57	3.921	0.900	2.190	0.846	3.984	0.893	2.230	0.896
w/o. ${L}_{content}$	16 kHz	0.45 kbps	57	3.578	0.892	2.175	0.848	3.571	0.885	2.223	0.904

III-B Evaluation

We evaluate FreeCodec from two aspects: 1) Reconstruction Quality. We conduct it on VCTK [30] and test-clean subset of LibriSpeech. For VCTK, we randomly select data from 8 speakers and 2911 utterances for the test. For LibriSpeech, we use the test-clean subset, 2620 utterances for the test. All audio samples are downsampled to 16 kHz. 2) Disentanglement Ability. we evaluate it based on the voice conversion benchmark. We randomly select 200 utterances from LibriSpeech test-clean subset as source speech and 6 speakers from VCTK as the target speaker. All models are evaluated in LibriSpeech Test-clean to VCTK scenarios.

Subjective Evaluation. We follow the established MUSHRA methodology [34] to evaluate the subjective quality of our baselines and FreeCodec-v2. A group of fifteen listeners participate in the subjective tests. Fifteen utterances are randomly selected from our test sets for evaluation. In addition, we also adopt the Speex [35] at 4 kbps as our low anchor.

Objective Evaluation. For objective evaluation of reconstruction, we employ the automatic Mean Opinion Score prediction system (UTMOS) [36], and the short-time objective intelligibility (STOI) [38], and the WARP-Q [37], and the Speaker Embedding Cosine Similarity (SECS)⁷⁷7https://github.com/resemble-ai/Resemblyzer to evaluate the overall speech quality. In addition, we use Word error rate (WER), character error rate (CER), and F0-PCC to evaluate the objective evaluation of voice conversion. Among them, WER and CER between source and converted speech are calculated by an ASR model⁸⁸8https://huggingface.co/openai/whisper-large. F0-PCC is the Pearson correlation coefficient used to evaluate ${f}_{0}$ consistency between source and converted speech.

TABLE II: The objective evaluation of disentanglement ability.

Method	Test-Clean to VCTK
Method	WER $\downarrow$	CER $\downarrow$	F0 PCC $\uparrow$	SECS $\uparrow$
FreeCodec-v3	8.37	6.14	0.702	0.847
YourTTS	9.20	6.92	0.682	0.815
Wav2vec-vc	13.23	9.20	-0.037	0.826
VQMIVC	56.58	39.21	0.611	0.650

IV Results

IV-A Reconstruction Quality

Table I summarizes the results of objective reconstruction experiments. The FreeCodec-v1 performs best in almost all objective metrics in test sets. Especially in out-of-domain environments, our proposed method achieves the best reconstruction performance using only approximately 57 tokens per second. Compared to the FreeCodec-v2, the FreeCodec-v1 is better especially in speaker similarity. It shows that the continuous global representation is more effective in reconstruction scenarios. Although slightly lower STOI and SECS than DAC at 1 kbps, FreeCodec-v2 gets better objective speech quality according to UTMOS and WARP-Q. The same result can also be concluded in subjective evaluation, as illustrated in Fig. 2. In addition, FreeCodec-v2 achieves higher quality than Lyra-v2 at 3.2 kbps, EnCodec at 3 kbps, and TiCodec at 1 kbps. It demonstrates that the more-grained disentanglement framework leads to lower bitrate but higher reconstruction quality.

Furthermore, we also conducted an ablation study to validate the explicit effect of content loss in the content encoder. It can be observed that removing the content loss causes the performance drop in all objective metrics, especially the UTMOS and STOI.

IV-B Disentanglement ability

In this section, we describe the disentanglement ability on the voice conversion experiments. FreeCodec-v3 achieves voice conversion by using the speaker information from the target speech. As shown in Table II, the FreeCodec-v3 achieves lower WER and CER than all baseline models, especially the text-based models. Meanwhile, the F0 PCC and speaker similarity of FreeCodec-v3 is also the highest. This indicates that our proposed method achieves superior disentanglement in intrinsic attributes of human speech.

V Conclusion

In this paper, we propose a more grained disentanglement framework that factorizes speech into the intrinsic attributes in a self-supervised manner. We show that a more-grained disentanglement framework can be used in reconstruction and disentanglement scenarios by different training strategies. Compared to existing methods, we use fewer tokens and lower bandwidth to achieve high-quality reconstruction. Our experiments show a significant improvement over existing methods, highlighting the effectiveness of our approach in reconstruction quality and disentanglement ability.

References

[1] D Rowe, “Codec 2-open source speech coding at 2400 bits/s and below,” in TAPR and ARRL 30th Digital Communications Conference, 2011, pp. 80–84.
[2] Lynn M Supplee, Ronald P Cohn, et al., “Melp: the new federal standard at 2400 bps,” in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1997, vol. 2, pp. 1591–1594.
[3] Neil Zeghidour, Alejandro Luebs, et al., “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2022.
[4] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
[5] Xue Jiang, Xiulian Peng, et al., “Latent-domain predictive neural speech coding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[6] Youqiang Zheng, Weiping Tu, et al., “Supercodec: A neural speech codec with selective back-projection network,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 566–570.
[7] Linping Xu, Jiawei Jiang, Dejun Zhang, et al., “An Intra-BRNN and GB-RVQ Based END-TO-END Neural Audio Codec,” in Proc. INTERSPEECH 2023, 2023, pp. 800–803.
[8] Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Zehan Wang, Ruiqi Li, et al., “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling,” arXiv preprint arXiv:2408.16532, 2024.
[9] Aaron Van Den Oord, Oriol Vinyals, et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
[10] Dongchao Yang, Songxiang Liu, et al., “Hifi-codec: Group-residual vector quantization for high fidelity audio codec,” arXiv preprint arXiv:2305.02765, 2023.
[11] Rithesh Kumar, Prem Seetharaman, et al., “High-fidelity audio compression with improved rvqgan,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[12] Youqiang Zheng, Weiping Tu, et al., “Srcodec: Split-residual vector quantization for neural speech codec,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 451–455.
[13] Adam Polyak, Yossi Adi, Jade Copet, et al., “Speech resynthesis from discrete disentangled self-supervised representations,” arXiv preprint arXiv:2104.00355, 2021.
[14] Ahmed Omran, Neil Zeghidour, et al., “Disentangling speech from surroundings with neural embeddings,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[15] Yong Ren, Tao Wang, Jiangyan Yi, et al., “Fewer-token neural speech codec with time-invariant codes,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12737–12741.
[16] Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu, “Speechtokenizer: Unified speech tokenizer for speech language models,” in The Twelfth International Conference on Learning Representations, 2024.
[17] Hanzhao Li, Liumeng Xue, et al., “Single-codec: Single-codebook speech codec towards high-performance speech generation,” arXiv preprint arXiv:2406.07422, 2024.
[18] Haohe Liu, Xuenan Xu, Yi Yuan, Mengyue Wu, Wenwu Wang, and Mark D Plumbley, “Semanticodec: An ultra low bitrate semantic audio codec for general sound,” arXiv preprint arXiv:2405.00233, 2024.
[19] Ziyue Jiang, Yi Ren, et al., “Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias,” arXiv preprint arXiv:2306.03509, 2023.
[20] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
[21] Yi Ren, Ming Lei, Zhiying Huang, et al., “Prosospeech: Enhancing prosody with quantized vector pre-training in text-to-speech,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7577–7581.
[22] Ziyue Jiang, Jinglin Liu, Yi Ren, et al., “Mega-TTS 2: Boosting prompting mechanisms for zero-shot speech synthesis,” in The Twelfth International Conference on Learning Representations, 2024.
[23] Zeqian Ju, Yuancheng Wang, et al., “Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” in Forty-first International Conference on Machine Learning, 2024.
[24] Yu Pan, Lei Ma, and Jianjun Zhao, “Promptcodec: High-fidelity neural speech codec using disentangled representation learning based adaptive feature-aware prompt encoders,” arXiv preprint arXiv:2404.02702, 2024.
[25] A Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
[26] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11976–11986.
[27] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[28] Jingyi Li, Weiping Tu, and Li Xiao, “Freevc: Towards high-quality text-free one-shot voice conversion,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[29] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
[30] Junichi Yamagishi, Christophe Veaux, et al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92),” 2019.
[31] Disong Wang, Liqun Deng, et al., “Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion,” arXiv preprint arXiv:2106.10132, 2021.
[32] Edresson Casanova, Julian Weber, et al., “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in International Conference on Machine Learning. PMLR, 2022, pp. 2709–2720.
[33] Jaemin Lim and Kiyeon Kim, “Wav2vec-vc: Voice conversion via hidden representations of wav2vec 2.0,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10326–10330.
[34] RB ITU-R, “1534-1, method for the subjective assessment of intermediate quality levels of coding systems (mushra),”,” International Telecommunication Union, 2003.
[35] Jean-Marc Valin, “Speex: A free codec for free speech,” arXiv preprint arXiv:1602.08668, 2016.
[36] Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” arXiv preprint arXiv:2204.02152, 2022.
[37] Wissam A Jassim, Jan Skoglund, et al., “Warp-q: Quality prediction for generative neural speech codecs,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 401–405.
[38] Cees H Taal, Hendriks, et al., “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE international conference on acoustics, speech and signal processing. IEEE, 2010, pp. 4214–4217.