ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models

Pengfei Zhu, Chao Pang, Yekun Chai, Lei Li, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu
\affiliationsBaidu Inc.
\emails{zhupengfei03, pangchao04, chaiyekun, lilei53, wangshuohuan, sunyu02, tianhao, wu_hua}@baidu.com

Abstract

In recent years, there has been an increased popularity in image and speech generation using diffusion models. However, directly generating music waveforms from free-form text prompts is still under-explored. In this paper, we propose the text-to-waveform music generation model that can receive arbitrary texts using diffusion models. We incorporate the free-form textual prompt as the condition to guide the waveform generation process of diffusion models. To solve the problem of lacking such text-music parallel data, we collect a dataset of text-music pairs from the Internet with weak supervision. Besides, we compare the effect of two prompt formats of conditioning texts (music tags and free-form texts) and prove the superior performance of our method in terms of text-music relevance. We further demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance. ¹¹1Generated cases are available at https://reurl.cc/94W4yO

1 Introduction

Music is a high-level human art form whose combination of harmony, melody, and rhythm makes people feel happy and adjust their mood. Its rich contents can express one’s feelings or emotions and even tell a story. In recent years, music generation has been a heated study with the development of deep learning techniques.

Some works DBLP:journals/corr/abs-2211-11216 aim at studying symbolic music generation, which learns to predict a sequence composition of notes, pitch, and dynamic attributes. The generated symbolic music does not contain performance attributes; thus, post-processing work to synthesize the audio of the music is usually needed. On the other hand, some works DBLP:journals/corr/abs-2208-08706 aim at studying generating audio or waveform music. It does not need extra work for synthesizing; however, the generated audio signals are usually more difficult to control to have satisfactory performance attributes.

Besides works on unconditional music generation, there have been explorations about conditional music generation DBLP:journals/corr/abs-2208-08706; DBLP:journals/corr/abs-2211-11248, which aims to meet the application requirements in scenarios such as automatic video soundtrack creation and auxiliary music creation with specific genres or features. Notably, generative models can leverage information from various modalities, such as text and image, to create relevant outputs for a conditional generation. However, the problem of generating waveform music based on free-form text has yet to be well-researched. While there have been several pieces of research on text conditional music generation such as DBLP:journals/corr/abs-2211-11216, BUTTER zhang2020butter and Mubert²²2https://github.com/MubertAI/Mubert-Text-to-Music, they are not able to directly generate musical audio based on free-form text.

To address the limitations of prior research, we propose ERNIE-Music, the first attempt at free-form text-to-music generation in the waveform domain using diffusion models. To solve the problem of lacking a large-amount parallel text-to-music dataset, we collect the data from the Internet by utilizing the “comment voting” mechanism. We apply conditional diffusion models to process musical waveforms to model the generative process and study which kind of text format benefits the model to learn text-music relevance better. To conclude, the contributions of this paper are:

•

We propose the music generation model that receives free-form texts as condition and generates waveform music based on the diffusion model;
•

We collect free-form text-music parallel data from the Internet utilizing the comment voting mechanism.
•

We study and compare using two forms of texts to condition the generative model and proves that using free-form text obtains better text-music relevance;
•

The results show that our model can generate diverse and high-quality music with higher text-music relevance, which outperforms other methods by a large margin.

2 Related Work

Controllable Music Generation

The task of controlled music generation has been plagued by the central question of how to impose constraints on music. VQ-CPC hadjeres2020vector learns local music features that do not contain temporal information. DBLP:journals/corr/abs-2208-08706 uses tempo information as a condition to generate "techno" genre music. BUTTER zhang2020butter uses a natural language representation of the music key, meter, style, and other relevant attributes to control the generation of music. DBLP:journals/corr/abs-2211-11216 further explored the effect of different pre-training models in text-to-music generation. Mubert first computes the encoded representation of the input natural language and the music tags using Sentence-BERT reimers2019sentence. It selects the music tags closest to the input and generates a combination of sounds from the set of sounds corresponding to these music tags. All sounds are created by musicians and sound designers.

Diffusion Models

Diffusion models DBLP:conf/icml/Sohl-DicksteinW15; DBLP:conf/nips/HoJA20 are latent variable models motivated by the non-equilibrium thermodynamics. They gradually destroy the structure of the original data distribution through an iterative forward diffusion process and then learn the reversal to reconstruct the original data by a finite iterative denoising process. In recent years, there has been an increased popularity of diffusion models in many areas, such as image generation DBLP:conf/icml/NicholDRSMMSC22; DBLP:conf/nips/DhariwalN21; ramesh2022hierarchical and audio generation DBLP:conf/iclr/ChenZZWNC21; Kreuk2022AudioGenTG. Our work is closely related to diffusion approaches for text-to-audio generation DBLP:conf/iclr/ChenZZWNC21; Kreuk2022AudioGenTG, which all generate audio waveforms but cope with different tasks. This work employs diffusion models to synthesize music waveforms given arbitrary textual prompts, while previous works focus on speech generation.

3 Method

In this section, we first introduce the backgrounds of diffusion models and then describe the text-conditional diffusion model we use, in addition to the model architecture and the training objectives.

Refer to caption — Figure 1: The overall architecture. The original Chinese text is “钢琴旋律的弦音，轻轻地、温柔地倾诉心中的遐想、心中的爱恋”.

3.1 Unconditional Diffusion Model

Diffusion Models DBLP:conf/icml/Sohl-DicksteinW15; DBLP:conf/nips/HoJA20 consist of a forward process which iteratively adds noise to a data sample and a reverse process which denoises a data sample by multiple times to generate a sample that conforms to real data distribution. We adopt the diffusion model defined with continuous time DBLP:journals/corr/abs-2107-00630; DBLP:journals/corr/abs-1905-09883; DBLP:conf/iclr/ChenZZWNC21; DBLP:conf/iclr/0011SKKEP21; DBLP:conf/iclr/SalimansH22.

Consider a data sample $x$ from distribution $p(x)$ ; diffusion models leverage latent variables $z_{t}$ where $t\in\mathbb{R}$ ranges from $0$ to $1$ . The log signal-to-noise-ratio $\lambda_{t}$ is defined as $\lambda_{t}={\rm log}[\alpha_{t}^{2}/\sigma_{t}^{2}]$ , where $\alpha_{t}$ and $\sigma_{t}$ denote the noise schedule.

For the forward process or diffusion process, Gaussian noise is added to the sample iteratively, which satisfies a Markov chain:

	$\displaystyle q(z_{t}\|x)$	$\displaystyle=\mathcal{N}(z_{t};\alpha_{t}x,\sigma_{t}^{2}\textbf{I})$
	$\displaystyle q(z_{t^{\prime}}\|z_{t})$	$\displaystyle=\mathcal{N}(z_{t^{\prime}};(\alpha_{t^{\prime}}/\alpha_{t})z_{t},\sigma_{t^{\prime}\|t}^{2}\textbf{I})$

where $t,t^{\prime}\in[0,1]$ and $t<t^{\prime}$ , and $\sigma_{t^{\prime}|t}^{2}=(1-e^{\lambda_{t^{\prime}}-\lambda_{t}})\sigma_{t^{\prime}}^{2}$ .

In the reverse process, a function approximation with parameters $\theta$ (denoted as $\hat{x}_{\theta}(z_{t},\lambda_{t},t)\approx x$ ) estimates the denoising procedure:

p_{\theta}(z_{t}|z_{t^{\prime}})=\mathcal{N}(z_{t};\tilde{\mu}_{t|t^{\prime}}(z_{t^{\prime}},x)),\tilde{\sigma}_{t|t^{\prime}}^{2}\textbf{I})

where $\tilde{\mu}_{t|t^{\prime}}(z_{t^{\prime}},x,t^{\prime}))=e^{\lambda_{t^{\prime}}-\lambda_{t}}(\alpha_{t}/\alpha_{t^{\prime}})z_{t^{\prime}}+(1-e^{\lambda_{t^{\prime}}-\lambda_{t}})\alpha_{t}x$ . Starting from $z_{1}\sim\mathcal{N}(0,\textbf{I})$ , by applying the denoising procedure on the latent variables $z_{t}$ , $z_{0}=\hat{x}$ can be generated at the end of reverse process. To train the denoising model $\hat{x}_{\theta}(z_{t},\lambda_{t},t)$ , without loss of generality, the weighted mean squared error loss is adopted:

L=E_{t\sim[0,1],\epsilon\sim\mathcal{N}(0,\textbf{I})}[w(\lambda_{t})\lVert\hat{x}_{\theta}(z_{t},\lambda_{t},t)-x\rVert_{2}^{2}]

(1)

where $w(\lambda_{t})$ denotes the weighting function and $\epsilon\sim\mathcal{N}(0,\textbf{I})$ denotes the noise.

3.2 Conditional Diffusion Model

Many works successfully implement generative models with conditional settings DBLP:journals/corr/MirzaO14; DBLP:conf/nips/SohnLY15; DBLP:conf/cvpr/RombachBLEO22. Conditional diffusion models approximate distribution $p(x|y)$ instead of $p(x)$ by modeling the denoising process $\hat{x}_{\theta}(z_{t},\lambda_{t},t,y)$ , where $y$ denotes the condition. $y$ can be any type of modality such as image, text, and audio. Specifically, in our text-to-music generation scenario, $y$ is a text prompt to guide the model to generate related music. We discuss the details of modeling the conditional diffusion model in the following sections.

3.2.1 Model Architecture

For text-to-music generation, the condition $y$ for the diffusion process is text. As shown in Figure 1, our overall model architecture contains a conditional music diffusion model which models the predicted velocity $\hat{v}_{\theta}(z_{t},t,y)$ DBLP:conf/iclr/SalimansH22, and a text encoder ${\rm E}(\cdot)$ which maps text tokens with length $n$ into a sequence of vector representations $[s_{0};S]$ with dimension $d_{E}$ , where $S=[s_{1},...,s_{n}]$ , and $s_{i}\in\mathbb{R}^{d_{E}}$ , and $s_{0}$ is the classification representation of the text.

The inputs of the music diffusion model are latent variable $z_{t}\in\mathbb{R}^{d_{c}\times d_{s}}$ , timestep $t$ (which is then transformed to the embedding $e_{t}\in\mathbb{R}^{d_{t}\times d_{s}}$ ), and the representation of text sequence $[s_{0};S]\in\mathbb{R}^{(n+1)\times d_{E}}$ , where $d_{c}$ denotes the number of the channels, $d_{s}$ denotes the sample size, $d_{t}$ denotes the feature size of the timestep embedding. The output is the estimated velocity $\hat{v}_{t}\in\mathbb{R}^{d_{c}\times d_{s}}$ .

Inspired by previous works about latent diffusion models DBLP:conf/icml/NicholDRSMMSC22; DBLP:conf/cvpr/RombachBLEO22; DBLP:conf/nips/DhariwalN21, we adopt the architecture of UNet DBLP:conf/miccai/RonnebergerFB15 whose key components are stacked convolutional blocks and self-attention blocks DBLP:conf/nips/VaswaniSPUJGKP17. Generation models can estimate the conditional distribution $p(x|y)$ , and the conditional information $y$ can be fused into the generative models in many ways DBLP:conf/nips/SohnLY15.

Our diffusion network targets at predicting the latent velocity $\hat{v}_{\theta}$ at randomly sampled timestep $t$ given the noised latent $z_{t}$ and a textual input $[s_{0};S]$ as the condition. To introduce the condition into the diffusion process, we perform an fusing operation ${\rm Fuse}(\cdot,\cdot)$ on the timestep embedding $e_{t}$ and the text classification representation $s_{0}$ to obtain the text-aware timestep embedding ${e^{\prime}}_{t}={\rm Fuse}(e_{t},s_{0})\in\mathbb{R}^{d_{t^{\prime}}\times d_{s}}$ . Then it is concatenated with $z_{t}$ to obtain $z^{\prime}_{t}=(z_{t}\oplus{e^{\prime}}_{t})\in\mathbb{R}^{(d_{t^{\prime}}+d_{c})\times d_{s}}$ . We omit the operations about the timestep embedding in Figure 1 for simplicity. In the following Section 4.6, we compare the performance of the different implementations of the fusing operation ${\rm Fuse}(\cdot,\cdot)$ .

Moreover, we introduce the conditional representation into the self-attention blocks DBLP:conf/nips/VaswaniSPUJGKP17, which model the global information of the music signals. In the self-attention blocks, consider the intermediate representation of $z^{\prime}_{t}\in\mathbb{R}^{(d_{t}+d_{c})\times d_{s}}$ is $\phi(z^{\prime}_{t})\in\mathbb{R}^{d_{a}\times d_{\phi}}$ , and $S\in\mathbb{R}^{n\times d_{E}}$ , the output is computed as follows:

{\rm Attention}(Q,K,V)={\rm softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V

{\rm head_{i}}={\rm Attention}(Q_{i},K_{i},V_{i})

Q_{i}=\phi(z^{\prime}_{t})\cdot W_{i}^{Q}

K_{i}={\rm Concat}(\phi(z^{\prime}_{t})\cdot W_{i}^{K},\,S\cdot W_{i}^{SK})

V_{i}={\rm Concat}(\phi(z^{\prime}_{t})\cdot W_{i}^{V},\,S\cdot W_{i}^{SV})

{\rm CSA}(\phi(z^{\prime}_{t}),\,S)={\rm Concat}({\rm head_{1}},...,{\rm head_{h}})W^{O}

where $W_{i}^{Q}\in\mathbb{R}^{d_{\phi}\times d_{q}}$ , $W_{i}^{K}\in\mathbb{R}^{d_{\phi}\times d_{k}}$ , $W_{i}^{V}\in\mathbb{R}^{d_{\phi}\times d_{v}}$ , $W_{i}^{SK}\in\mathbb{R}^{d_{E}\times d_{k}}$ , $W_{i}^{SV}\in\mathbb{R}^{d_{E}\times d_{v}}$ , $W^{O}\in\mathbb{R}^{hd_{v}\times d_{\phi}}$ are parameter matrices, and $h$ denotes the number of heads, and ${\rm CSA}(\cdot,\cdot)$ denotes the conditional self-attention operation.

3.2.2 Training

Following DBLP:conf/iclr/SalimansH22, we set the weighting function in equation 1 as the “SNR+1” weighting for a more stable denoising process.

Specifically, for the noise schedule $\alpha_{t}$ and $\sigma_{t}$ , we adopt the cosine schedule DBLP:conf/icml/NicholD21 $\alpha_{t}={\rm cos}(\pi t/2)$ , $\sigma_{t}={\rm sin}(\pi t/2)$ , and the variance-preserving diffusion process satisfies $\alpha_{t}^{2}+\sigma_{t}^{2}=1$ .

We denote the function approximation as $\hat{v}_{\theta}(z_{t},t,y)$ , where $y$ denotes the condition. The prediction target of $\hat{v}_{\theta}(z_{t},t,y)$ is velocity $v_{t}\equiv\alpha_{t}\epsilon-\sigma_{t}x$ , which gives $\hat{x}=\alpha_{t}z_{t}-\sigma_{t}\hat{v}_{\theta}(z_{t},t,y)$ . Finally, our training objective is:

	$\displaystyle L_{\theta}$	$\displaystyle=(1+\alpha_{t}^{2}/\sigma_{t}^{2})\lVert x-\hat{x}_{t}\rVert_{2}^{2}$
		$\displaystyle=\lVert v_{t}-\hat{v}_{t}\rVert_{2}^{2}$

Algorithm 1 (in Appendix) displays the complete training process with the diffusion objective proposed by DBLP:conf/iclr/SalimansH22.

4 Experiments

	Train	Test
Num. of Data Samples	3890	204
Avg. Text (Tokens) Length	63.23	64.45
Music Sample Rate	16000
Music Sample Size	327680
Music Duration	20 seconds

Table 1: The statistics of our collected Web Music with Text dataset.

4.1 Implementation Details

Following previous works DBLP:conf/cvpr/RombachBLEO22; DBLP:conf/icml/NicholDRSMMSC22; DBLP:conf/nips/HoJA20, we use UNet DBLP:conf/miccai/RonnebergerFB15 architecture for the diffusion model. The UNet model uses 14 layers of stacked convolutional blocks and attention blocks for the downsample and upsample module, with skipped connections between layers with the same hidden size. It uses the input/output channels of 512 for the first ten layers and two 256s and 128s afterward. We employ the attention at 16x16, 8x8, and 4x4 resolutions. The sample size and sample rate of the waveform are 327,680 and 16,000, and the channel size is 2. The timestep embedding layer contains trainable parameters of 8x1 shape. It concatenates the noise schedule to obtain the embedding, which is then expanded to the sample size to obtain $e_{t}\in\mathbb{R}^{16\times 327,680}$ . For the text encoder ${\rm E}(\cdot)$ , we use ERNIE-M ouyang-etal-2021-ernie, which can process multi-lingual texts. .

4.2 Dataset

There are few available parallel training data pairs of free-form text and music. To solve this problem, we collect an amount of such data from the Internet, whose language is mainly Chinese.

We use the Internet’s “comment voting” mechanism to collect such data. On the music service supporting platforms, some users enjoy writing comments on the kinds of music that interest them. If other users think these comments are of high quality, they will click the “upvote” button, and comments with a high count of upvotes can be selected as the “popular comment”. By our observation, the “popular comments” are generally relatively high quality and usually contain useful music-related information such as musical instruments, genres, and expressed human moods. Based on such rules, we collect a large amount of comment text-music parallel data pairs from the Internet to train the text conditional music generation model. The statistics of our collected Web Music with Text dataset and examples are listed in Table 1 and 7.

4.3 Evaluation Metric

For the text-to-music generation task, we evaluate performance in two aspects: text-music relevance and music quality. We manually score the generated music and calculate the mean score by averaging over results from different evaluators. We use the compared methods or models to generate music based on texts from the test set. We hire 10 people to perform human evaluation, scoring the music generated by each compared model, and then average the scores over the 10 people for each generated music. The identification of models corresponding to the generated music is invisible to the evaluators. Finally, we average the scores of the same model on the entire test samples to obtain the final evaluation results of the models.

4.4 Compared Methods

The methods for comparison are Text-to-Symbolic Music (denoted as TSM) DBLP:journals/corr/abs-2211-11216, Mubert and Musika DBLP:journals/corr/abs-2208-08706. The generated music from Mubert is actually created by human musicians, and TSM only generates music score, which needs to be synthesized into music audio by additional tools, so the music quality among Mubert, TSM, and our model is not comparable. Thus, we only compare the text-music relevance between them and our model. To synthesize the music audio based on the symbolic music score generated by TSM, we first adopt abcMIDI³³3https://github.com/sshlien/abcmidi to convert the abc file output by TSM to MIDI file and then use FluidSynth⁴⁴4https://github.com/FluidSynth/fluidsynth to synthesize the final music audio. For music quality, we compare our model’s performance with Musika, a recent famous work that also directly generates waveform music.

Method	Score $\uparrow$	Top Rate $\uparrow$	Bottom Rate $\downarrow$
TSM DBLP:journals/corr/abs-2211-11216	2.05	12%	27%
Mubert	1.85	37%	32%
our model	2.43	55%	12%

Table 2: Comparison of text-music relevance.

Method	Score $\uparrow$	Top Rate $\uparrow$	Bottom Rate $\downarrow$
Musika DBLP:journals/corr/abs-2208-08706	3.03	5%	13%
our model	3.63	15%	2%

Table 3: Comparison of music quality.

4.5 Results

Table 2 and 3 show the evaluation results of text-music relevance and music quality. For text-music relevance evaluation, we use a ranking score of 3 (best), 2, 1 to denote which of the three models has the best relevance given a piece of text. For music quality, we use a five-level score of 5 (best), 4, 3, 2, 1, which indicates to what extent the evaluator prefers the melody and coherence of the music. The top rate means the probability that the music obtains the highest score, and the bottom rate means the probability that the music obtains the lowest score. The results indicate that our model can generate music with better quality and text-music relevance which outperforms related works by a large margin.

4.6 Analysis

Diversity

The music generated by our model has a high diversity. In terms of melody, our model can generate music with a softer and more soothing rhythm or more passionate and fast-paced music. In terms of emotional expression, some music sounds sad, while there is also very festive and cheerful music. In terms of musical instruments, it can generate music composed by various instruments, including piano, violin, erhu, and guitar. We selected two examples with apparent differences and analyzed them based on the visualization results. As shown in the waveform from Figure 2, the fast-paced guitar piece has denser sound waves, while the piano pieces have a slower, more soothing rhythm. Moreover, the spectrogram shows that the guitar piece holds dense high and low-frequency sounds, while the piano piece is mainly in the bass part.

Comparison of Different Text Condition Fusing Operations

As introduced in Section 3.2.1, we compare two implementations of the fusing operation ${\rm Fuse}(\cdot,\cdot)$ , namely concatenation and element-wise summation. To evaluate the effect, we compare the performance on the test set as the training progresses. For every 5 training steps, we adopt the model checkpoint to generate pieces of music based on the texts in the test set and calculate the Mean Squared Error (MSE) of generated music and gold music from the test set. The visualization results are shown in Figure 3, which indicates no apparent difference between the two fusing operations. For simplicity, we finally adopt the element-wise summation.

Comparison of Different Formats of Input Text

Our proposed method leverages free-form text to generate music. However, considering that the more widely used methods in other works generate music based on a set of pre-defined music tags representing the specific music’s feature zhang2020butter, we compare these two methods to obtain better text-music relevance of generated music: (1) End-to-End Text Conditioning. Suppose the training data consists of multiple text and music pairs $<Y,X>$ . The texts in $Y$ are free-form, describing some scenario, emotion, or just a few words about music features. We adopt the straightforward way to process the texts: to input them into the text encoder ${\rm E}(\cdot)$ to obtain the text representations. It relies on the natural high correlation of the $<Y,X>$ , and the conditional diffusion model dynamically learns to capture the critical information from the text in the training process. (2) Music Tag Conditioning. Using short and precise music tags as the text condition may make it easier for the model to learn the mapping between text and corresponding music. We analyze the text data from the training set and distill critical information from the texts to obtain music tags. Examples as shown in Table 5. The key features of the music in a piece of long text are limited and can be extracted as music tags. We randomly select 50 samples from the test set for manual evaluation. Table 4 shows the evaluation results of the two conditioning methods, which indicates that our proposed free-form text-based music generation method obtains better text-music relevance than using pre-defined music tags. The main reason might be that the human-made music tag selection rules introduce much noise and result in the loss of some useful information from the original text. Thus it is better to use the End-to-End Text Conditioning method for the model to learn to capture useful information dynamically.

Method	Score $\uparrow$	Top Rate $\uparrow$	Bottom Rate $\downarrow$
Music Tag Conditioning	1.7	22%	52%
End-to-End Text Conditioning	2.3	40%	10%

Table 4: Comparison of text-music relevance between two conditioning text formats.

5 Conclusion

In this paper, we propose ERNIE-Music, the music generation model to generate music audio based on free-form text. To solve the problem of lacking such text and music parallel data, we collect music from the Internet paired with their comment texts describing various music features. In order to analyze the effect of text format on learning text-music relevance, we mine music tags from the entire text collection and compare utilizing the two forms of text conditions. The results show that our free-form text-based conditional generation model creates diverse and coherent music and outperforms related works in music quality and text-music relevance.

Limitations

While our model successfully generates coherent and pleasant music, it is important to acknowledge several limitations that can be addressed in future research. The primary limitation is the fixed and relatively short length of the generated music. Due to computational resource constraints, we were unable to train the model on longer sequences. Altering the length during the inference phase can negatively impact performance, which is an area for further investigation.

Another limitation is the relatively slow speed of the generation process. The iterative nature of the generation procedure contributes to this slower speed. Exploring techniques to optimize the generation process and reduce computational overhead could enhance the efficiency of music generation in the future.

Additionally, it is important to note that our current model is designed to generate instrumental music and does not incorporate human voice. This limitation stems from the training data used, which primarily consists of instrumental music. Expanding the training dataset to include vocal music could enable the generation of music with human voice, offering a more comprehensive music generation system.

Appendix A Dataset

Examples of our collected dataset can be seen in Table 6.

Table 5: Examples of free-form texts and corresponding music tags.

Text	Tags
聆听世界著名的钢琴曲简直是一种身心享受，我非常喜欢 Listening to the world famous piano music is simply a kind of physical and mental enjoyment, I like it very much	钢琴 piano
钢琴旋律的弦音，轻轻地、温柔地倾诉心中的遐想、心中的爱恋 The strings of the piano melody, gently and tenderly express the reverie and love in the heart	钢琴，轻轻，温柔，爱 piano, gentle, tender, love
提琴与钢琴合鸣的方式，在惆怅中吐露出淡淡的温柔气息 The ensemble of violin and piano reveals a touch of gentleness in melancholy	钢琴, 小提琴, 温柔, 惆怅 piano, violin, gentle, melancholic

Table 6: Examples of the adopted and abandoned tags

	Tags
Adopted	希望，生命，钢琴，小提琴，孤独，温柔，幸福，悲伤，游戏，电影 hope, life, piano, violin, lonely, gentle, happiness, sad, game, movie
Abandoned	音乐，喜欢，感觉，世界，好听，旋律，永远，音符，演奏，相信 music, like, feeling, world, good-listening, melody, forever, note, play, believe

Title	Musician	Text
风的礼物 Gift of the Wind	西村由纪江 Yukie Nishimura	轻快的节奏，恰似都市丽人随风飘过的衣袂。放松的心情，片刻的愉快驱散的是工作的压力和紧张，沉浸其中吧，自己的心。 The brisk rhythm is like the clothes of urban beauties drifting in the wind. A relaxed mood, a moment of pleasure, dispels the pressure and tension of work. Immerse yourself, your own heart, in it.
九龙水之悦 Joy of the Kowloon Water	李志辉 Zhihui Li	聆听［九龙水之悦］卸下所有的苦恼，卸下所有的沉重，卸下所有的忧伤，还心灵一份纯净，还人生一份简单。 Listen to “The Joy of the Kowloon Water" to remove all the troubles, all the heaviness, and all the sorrows and restore the purity of the soul and the simplicity of life.
白云 Nuvole Bianche	鲁多维科·伊诺 Ludovico Einaudi	钢琴的更宁静，可大提琴的更多的是悠扬和深沉，也许是不同的演奏方式带来不同的音乐感受吧。 The piano is more serene, but the cello is more melodious and deep. Perhaps different playing methods bring different musical feelings.

Table 7: Examples of our Web Music with Text dataset.

Appendix B Implemention Details

We train the model for 580,000 steps using Adam optimizers with a learning rate of 4e-5 and a training batch size of 96. We save exponential moving averaged model weights with a decay rate of 0.995, except for the first 25 epochs.

Algorithm 1 Training

repeat

x\sim p(x|y)

t\sim\text{Uniform}([0,1])

\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{I})

v_{t}\leftarrow\alpha_{t}\epsilon-\sigma_{t}x

Take gradient step on

\nabla_{\theta}\|v_{t}-\hat{v}_{\theta}(\alpha_{t}x+\sigma_{t}\epsilon,t,y)\|^{2}

until converged

Appendix C Music Tags Extraction

To obtain the music tags, we use the TF-IDF model to mine terms with higher frequency and importance from the dataset. Given a set of text $Y$ , the basic assumption is that the texts contain various words or phrases related to music features such as instruments and genres. We aim to mine a tag set $T$ from $Y$ . We assume two rules to define a good music tag representing typical music features: 1) A certain amount of different music can be described with the tag for the model to learn the “text(tag)-to-music” mapping without loss of diversity; 2) A tag is worthless if it appears in the descriptions of too many pieces of music. For example, almost every piece of music can be described as “good listening"; thus, it should not be adopted as a music tag. Based on such rules, we leverage the TF-IDF model to mine the music tags. Because the language of our dataset is Chinese, we use jieba⁵⁵5https://github.com/fxsjy/jieba to cut the sentences into terms. For a term $w$ , we make statistics on the total dataset to obtain the TF ${\rm tf}(w)$ and the IDF ${\rm idf}(w)$ , then the term score is obtained as ${\rm score}(w)={\rm tf}(w)\cdot{\rm idf}(w)$ . We reversely sort all the terms based on ${\rm score}(w)$ and manually select 100 best music tags to obtain the ultimate music tag set $T$ , which can represent the features of music such as instruments, music genres, and expressed emotions. Table 6 displays examples of the adopted and abandoned terms.

We use the mined music tags to condition the diffusion process. For a piece of music from the training data, we concatenate its corresponding music tags with a separator symbol “，” to obtain a music tag sequence as the conditioning text to train the model.

	$\displaystyle q(z_{t}\|x)$	$\displaystyle=\mathcal{N}(z_{t};\alpha_{t}x,\sigma_{t}^{2}\textbf{I})$
	$\displaystyle q(z_{t^{\prime}}\|z_{t})$	$\displaystyle=\mathcal{N}(z_{t^{\prime}};(\alpha_{t^{\prime}}/\alpha_{t})z_{t},\sigma_{t^{\prime}\|t}^{2}\textbf{I})$