DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

Abstract

Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.

Index Terms— domain-specific pre-train, language-audio pre-train, intelligent voice assistants

1 Introduction

The ability to jointly understand both acoustic and semantic signals has become a critical ability for IVAs to interpret voice signals in the physical world [1, 2, 3]. A wide range of tasks based on real-life voice have been designed to test such ability, including Multimodal Device-directed Speech Detection (MDSD) [4, 5, 6] and Multimodal Conversational Intent Classification (MCIC) [7, 8]. The de facto paradigm for addressing these cross-modal tasks is to first extract acoustic features from pre-trained acoustic models [9, 10] and textual features from pre-trained language models [11, 12] and then fuse them for task-specific finetune.

Existing approaches following this paradigm have achieved sustained success due to the addition of pre-trained models, yet suffer from two main drawbacks: (1) Models are pre-trained solely on pure text or audio data, resulting in the extracted modal representations that are disconnected in the semantic space. (2) Pre-training tasks based on generic domain data may compromise the accuracy of downstream tasks in the target domain due to the disconnections between domains.

In this work, we propose to mitigate the above problems with a simple yet effective approach, in which we combine text and audio modalities from the target domain for domain-specific pre-training. Naïvely, we can directly refer to some popular vision-language pre-training frameworks ( $\mathit{e.g.,}$ CLIP [13] and ALIGN [14]) and extend them to the language-audio settings [15]. However, for many real-world IVA application domains, such as medical and in-vehicle, collecting and cleaning aligned language-audio pairs is time-consuming and requires specialized annotators due to the privacy of the data, leading us always to have insufficient pairs for pre-training. On the other hand, the majority of IVAs in the real world convert the received audio signal into text through Automatic Speech Recognition (ASR) for natural language understanding. We notice that the potential of these already available language-audio pairs (probably imperfect) has not been fully explored in pre-training tasks.

To this end, we propose an approximate approach to automatically contrast language-audio pairs for domain-specific multimodal pre-training, which we refer to as Domain-Specific Contrastive Language-Audio Pre-Training (DSCLAP). Specifically, we train the model to align the two modalities into a common semantic space by maximizing the mutual information between the raw audio and ASR transcriptions with a contrastive learning objective and a matching objective. This ASR-transcribing-then-training strategy greatly reduces the need for paired data, allowing economical end-to-end multimodal pre-training from raw audio data in the specific domain.

We evaluated the effectiveness of our DSCLAP approach in the in-vehicle interaction domain, for which we collected over 12,107 hours of in-vehicle interaction audio data to pre-train the DSCLAP model. Experimental results show that DSCLAP significantly outperforms the baseline model and achieves the best performance not only on the MDSD task but also on the more novel yet challenging MCIC task. Notably, to our best knowledge, DSCLAP is the first model that requires only audio modality input for language-audio pre-training, which we believe is a crucial step towards domain-specific multimodal pre-training.

2 Method

We propose a new framework that enables multimodal pre-training with only raw audio modality input. The overall architecture of the framework is shown in Fig. 1. Specifically, we first translate audio signals to text with an ASR system. Then, we encode these two modalities and fuse their representations with a cross-modal modeling layer. Finally, we train the model with a contrastive learning objective and a language-audio matching objective.

Refer to caption — Fig. 1: An illustration of the DSCLAP framework. In contrast to prior work [15] that leverages pre-prepared language-audio pairs for contrastive learning pretraining (red dashed arrows), our DSCLAP (black hard arrows) requires only raw audio inputs. Besides the standard InfoNCE loss, inspired by [16], we introduce a Language-Audio Matching (LAM) objective to achieve more effective contrastive learning.

2.1 Embeddings

For encoding audio frames we utilize a pre-trained Wav2vec2 whose architecture follows the work in [9]. Let the raw waveform be $\textbf{\footnotesize{X}}_{a}$ and $\mathcal{F}_{a}$ represent the Wav2vec2 encoder. For $N$ waveforms in a batch, the final acoustic representations are calculated as follows:

h^{(i)}_{a}=\text{Pooling}(\mathcal{F}_{a}(\textbf{\footnotesize{X}}_{a}^{(i)})),~{}~{}i\in\{1,2,...,N\}

(1)

where $h^{(i)}_{a}\in\mathbb{R}^{d}$ is the $i$ -th audio representation of dimensionality ${d}$ . A $Pooling$ layer ( $\mathit{i.e.,}$ mean-pooling) is applied to aggregate the frame-level features into an utterance-level representation.

We choose TinyBERT [12] as the text backbone network, a Transformer model trained with attention-based distillation and hidden states-based distillation. Let the text be $\textbf{\footnotesize{X}}_{t}$ and $\mathcal{F}_{t}$ represent the TinyBERT encoder. Similar to the audio encoding process, we can obtain the linguistic representation $h^{(i)}_{t}$ :

h^{(i)}_{t}=\text{Pooling}(\mathcal{F}_{t}(\textbf{\footnotesize{X}}_{t}^{(i)})),~{}~{}i\in\{1,2,...,N\}

(2)

2.2 Contrastive Learning

Given a representation pair ( $h^{(i)}_{a}$ , $h^{(i)}_{t}$ ), following CLIP [13], we use InfoNCE loss to measure dependencies between the audio and text modality. Specifically, we project $h^{(i)}_{a}$ and $h^{(i)}_{t}$ onto the latent embedding space. Then we maximize the similarity between the pair of audio and text representation ( $\mathit{i.e.,}$ positive pair) while minimizing the similarity between the negative pairs as follows:

	$\displaystyle\textbf{z}^{(i)}_{a}=\phi_{a}(h^{(i)}_{a}),~{}~{}\textbf{z}^{(i)}_{t}=\phi_{t}(h^{(i)}_{t})$		(3)
	$\displaystyle\mathcal{L}_{a}=-\log\frac{\mathrm{exp}(\mathrm{sim}(\textbf{z}^{(i)}_{a},\textbf{z}^{(i)}_{t})/\tau)}{\sum_{\textbf{z}^{(j)}_{t}\in\mathcal{B}}\mathrm{exp}(\mathrm{sim}(\textbf{z}^{(i)}_{a},\textbf{z}^{(j)}_{t})/\tau)}$		(4)
	$\displaystyle\mathcal{L}_{t}=-\log\frac{\mathrm{exp}(\mathrm{sim}(\textbf{z}^{(i)}_{t},\textbf{z}^{(i)}_{a})/\tau)}{\sum_{\textbf{z}^{(j)}_{a}\in\mathcal{B}}\mathrm{exp}(\mathrm{sim}(\textbf{z}^{(i)}_{t},\textbf{z}^{(j)}_{a})/\tau)}$		(5)

where $\phi_{a}$ and $\phi_{t}$ are learned fully-connected projection functions, $\textbf{z}^{(i)}_{a}$ and $\textbf{z}^{(i)}_{t}$ are hidden representations of the $d$ dimension, $sim()$ is a similarity function ( $\mathit{e.g.,}$ dot product). Furthermore, $\mathcal{B}=\{\textbf{z}^{(1)}_{\beta},\textbf{z}^{(2)}_{\beta},...,\textbf{z}^{(N)}_{\beta}\}$ ( $\beta\in\{a,t\}$ ) is a set of hidden representations, which contains a positive sample $\textbf{z}^{(i)}_{\beta}$ and N-1 negative samples ( $\mathcal{B}-\textbf{z}^{(i)}_{\beta}$ ).

2.3 Language-Audio Matching

Contrasting learning objective often requires a large batch size, which in turn increases a tremendous computational resource cost. To cut down the heavy resource dependency, we introduce a language-audio matching objective with hard negative sampling [16] to allow us to conduct efficient multimodal representation alignment with limited resources. Specifically, for a given fixed multimodal representation pairs ( $\textbf{z}^{(i)}_{a}$ , $\textbf{z}^{(i)}_{t}$ ), a set of hard negative examples { ( $\textbf{z}^{(i)}_{a}$ , $\textbf{z}^{(k)}_{t}$ ), ( $\textbf{z}^{(k)}_{a}$ , $\textbf{z}^{(i)}_{t}$ )}, are sampled from the top-K most similar representations in the training batch. In other words, we use carefully selected hard negative samples as an additional negative sample for contrastive learning as follows:

	$\displaystyle\mathcal{L}_{a^{*}}=-\log\frac{\mathrm{exp}(\mathrm{sim}(\textbf{z}^{(i)}_{a},\textbf{z}^{(i)}_{t})/\tau)}{\sum_{\textbf{z}^{(\bar{j})}_{t}\in\{\textbf{z}^{(i)}_{t},~{}\textbf{z}^{(k)}_{t}\}}\mathrm{exp}(\mathrm{sim}(\textbf{z}^{(i)}_{a},\textbf{z}^{(\bar{j})}_{t})/\tau)}$		(6)
	$\displaystyle\mathcal{L}_{t^{*}}=-\log\frac{\mathrm{exp}(\mathrm{sim}(\textbf{z}^{(i)}_{t},\textbf{z}^{(i)}_{a})/\tau)}{\sum_{\textbf{z}^{(\bar{j})}_{a}\in\{\textbf{z}^{(i)}_{a},~{}\textbf{z}^{(k)}_{a}\}}\mathrm{exp}(\mathrm{sim}(\textbf{z}^{(i)}_{t},\textbf{z}^{(\bar{j})}_{a})/\tau)}$		(7)

2.4 DSCLAP objective

Incorporating the loss on the contrastive learning and the cross-modal matching introduced above, we estimate the parameters of the pre-train model $\theta$ by minimizing the following objective:

\min_{\theta}\lambda(\mathcal{L}_{a}+\mathcal{L}_{t})+\gamma(\mathcal{L}_{a^{*}}+\mathcal{L}_{t^{*}})

(8)

where $\lambda$ and $\gamma$ are hyperparameters that determine the contribution of each regularization component. For all the experiments, we set $\lambda$ and $\gamma$ as $0.5$ , which we searched through cross-validation.

3 Experiments

3.1 Datasets

3.1.1 Pre-training

We pre-train our DSCLAP in the in-vehicle domain. Specifically, we collect a total of 12,107 hours of training corpus, which contains more than 20M unlabeled raw audio from more than 250K vehicles via an in-vehicle IVA. We convert these audios to text using the iFLYTEK cloud speech service¹¹1www.xfyun.cn/services/lfasr, which provides the current state-of-the-art ASR system in China, achieving an approximate 18.7% character error rate in the training corpus. Note that although the language-audio pairs used for pre-training may not be perfect, DSCLAP can still learn appropriate representations from positive and negative examples through comparative learning. Moreover, fine-tuning on downstream tasks can further correct the effects of ASR errors. We will discuss this in Section 4.

3.1.2 Downstream Tasks

We evaluate DSCLAP on two downstream tasks, corresponding to the two hot research fields: 1) Multimodal Device-directed Speech Detection (MDSD), which allows IVAs to continuously listen to the user within a predefined period without repeatedly detecting a wake-up-word or trigger phrase in human-computer interaction [6, 17]. 2) Multimodal Conversational Intent Classification (MCIC), which aims to jointly understand user intent through information from both text and audio modalities [7, 8]. Both MDSD and MCIC are classification tasks, where MDSD includes two categories of device-directed or non-device-directed, and MCIC includes 15 intents, such as playing music, navigating, calling, and chatting. The details of training data, validation data, and test data splitting are shown in Table 1.

Task		Train	Valid	Test	Count
MDSD	ASR-only	10,000	5,000	4,800	19,000
MDSD	ASR&Manual	2356	337	674	3,367
MCIC		6,400	2,000	1,800	10,200

Table 1: Statistical information and data partition of datasets used in this paper.

3.2 Implementation Details

We use AdamW optimizer for pre-training and the learning rate is initialized to 2e-5. The the temperature $\tau$ of contrastive loss is initialized to $log(1\//0.07)$ . The maximum sequence length is set to 16 and 80,000 for texts and audio waveforms respectively. Our model is implemented in PyTorch Lightning and transformers libraries. We train it for 20 epochs with mixed precision, on 8 NVIDIA RTX 3090 GPUs with a batch size of 64 per GPU. The whole training process takes 7.5 days to complete. Notably, for a fair comparison, we used five random seeds (i.e. 1, 12, 123, 1234, and 12345) to trigger experiments in both of the downstream tasks and report the average performance of the five experiments for each task.

4 RESULTS AND ANALYSIS

4.1 Quantitative Results

We combined several recent state-of-the-art unimodal pre-train models to benchmark their performance on domain-specific downstream tasks and report the results of the two best-performing models: (Whisper [10], BERT [12]) and (Wav2vec2 [9], BERT [12]), where the better-performing (Wav2vec2, BERT) is set as a baseline backbone network (annotated by Base). Additionally, we also introduce two novel competitive optimization methods (Rdrop [18] and OGM-GE [19]) on the Base model.

The comparative results are presented in Table 2. In both datasets, DSCLAP achieves the best performance and surpasses the variants across all metrics (MDSD and MCIC combined). This confirms that our method can be better suited to domain-specific multimodal tasks. We observe that the proposed model significantly reduced the false rejection rate (FRR) in the MDSD task from 5.35% to 2.77%, which is of great benefit to IVAs because it will enable IVA to respond to more user requests without wake-up steps.

Model	MDSD		MCIC
Model	ACC	FRR	ACC	F1
(Whisper [10], BERT [12])	91.25	10.82	87.57	86.98
(Wav2vec2 [9], BERT [12]) ${}_{\textbf{Base}}$	92.48	6.11	87.93	87.52
Base-OGM-GE [19]	92.51	5.93	88.90	88.47
Base-RDrop [18]	93.57	5.35	88.54	88.01
DSCLAP	95.06	2.77	89.98	89.76

Table 2: Experimental results on MDSD ASR-only dataset and MCIC dataset. The best results are highlighted in bold.

4.2 Ablation Study

4.2.1. Do we need a larger training data size for finetuning? We compare the impact of different training data sizes $M\in\{1\text{\footnotesize{K}},2\text{\footnotesize{K}},...,10\text{\footnotesize{K}}\}$ on the Base model and our proposed model on the MDSD task, results shown in Fig. 2. We observe that the accuracy curves of both models showed an increasing trend with the increase in training data. However, with the decrease in data size, the accuracy of the Base model declined dramatically. In contrast, our model is still stable at an accuracy higher than 93.33% with $1\text{\footnotesize{K}}$ training data, suggesting that DSCLAP can help the Base model better adapt to few-shot data scenarios and provide a possible solution for cold start problems with the domain-specific model.

4.2.2. Is pre-training with imperfect language-audio pairs feasible? We also experiment on different text inputs under the same audio source, with results shown in Table 3. We observe that our DSCLAP outperforms the Base model by 2.26% on manually transcribed data. This demonstrates that although there is some noise (ASR transcription error) in the pre-training data, DSCLAP can effectively learn the alignment information between the two modalities by contrastive learning. the model improves from 2.26% to 5.37% when comparing the ASR transcript pairs to manual transcript pairs fine-tuning results. This is an encouraging result that enables us to better apply it to IVAs since the text data of IVAs are always transformed by real-time ASR systems. Additionally, we found that DSCLAP did not perform as well as ASR transcriptions on manual transcriptions. This may be caused by the inconsistency between the downstream task and the pre-training task input. Because DSCLAP is trained on a large amount of ASR transcription pairs, it may be more suitable for downstream tasks that use ASR transcription pairs as input.

Source	Base	DSCLAP	$\bigtriangledown$
ASR	89.82	95.19	$\uparrow$ 5.37
Manual	91.98	94.24	$\uparrow$ 2.26

Table 3: Comparison of performance of models in ASR transcript pairs and manual transcript pairs on ASR & Manual dataset. The col with

\bigtriangledown

means the improvements of our model compared to the Base model in accuracy.

4.2.3. Base encoders vs. DSCLAP encoders. To investigate the difference in generalization between the Base and DSCLAP on encoders, we compare the two models with variants that freeze portions of the parameters. Results are shown in Table 4. Overall, our DSCLAP achieves better performance than the base model on all variants, showing the benefits of domain-specific pre-training. For DSCLAP, we observe that training only with the text encoder parameters results in the worst performance, contrary to previous works [15]. This is possibly because the text used for fine-tuning is noisy and imperfect ASR transcriptions, instead of manually transcripts in previous work.

$\mathcal{F}_{a}$	$\mathcal{F}_{t}$	Base	DSCLAP	$\bigtriangledown$
✗	✗	85.08	93.66	$\uparrow$ 8.58
✗	✔	88.63	90.67	$\uparrow$ 1.94
✔	✗	91.17	94.95	$\uparrow$ 3.78
✔	✔	92.48	95.06	$\uparrow$ 2.58

Table 4: Base encoders vs. DSCLAP encoders. ✔ indicates the parameters of the corresponding encoder are trainable.

5 CONCLUSION

In this work, we aim to address the domain-specific multimodal pre-train problem that lacks paired training data. To this end, we propose DSCLAP, which uses an ASR system to automatically generate text from raw speech signals and builds a multimodal pre-train model with this language-audio pair. Compared with previous works, DSCLAP greatly reduces the cost of data collection and cleaning, allowing economical end-to-end learning from raw audio and ASR transcriptions. By pre-training on 12,107 hours of in-vehicle domain audio data, DSCLAP provides more efficient domain-specific modal representations, helping the model achieve state-of-the-art results on downstream tasks.

DSCLAP demonstrates that efficient multimodal pre-training can be accomplished using only raw audio signals. We believe that the performance of DSCLAP on downstream tasks could lead to many exciting possibilities for its future applications in IVA. We hope that the training strategy of DSCLAP could encourage researchers for further explorations on different domains.

References

[1] Giuseppe Riccardi, “Towards healthcare personal agents,” in RFMIR, 2014.
[2] Ethan Selfridge and Michael Johnston, “Interact: Tightly-coupling multimodal dialog with an interactive virtual assistant,” ICMI, 2015.
[3] Che-Wei Huang, Roland Maas, Sri Harish Reddy Mallidi, and Björn Hoffmeister, “A study for improving device-directed speech detection toward frictionless human-machine interaction,” in INTERSPEECH, 2019.
[4] Sri Harish Reddy Mallidi, Roland Maas, Kyle Goehner, Ariya Rastrow, Spyridon Matsoukas, and Björn Hoffmeister, “Device-directed utterance detection,” in INTERSPEECH, 2018.
[5] Atta Norouzian, Bogdan Mazoure, Dermot Connolly, and Daniel Willett, “Exploring attention mechanism for acoustic-based classification of speech utterances into system-directed and non-system-directed,” ICASSP, pp. 7310–7314, 2019.
[6] Kellen Gillespie, Ioannis C Konstantakopoulos, Xingzhi Guo, Vishal Thanvantri Vasudevan, and Abhinav Sethy, “Improving device directedness classification of utterances with semantic lexical features,” in ICASSP, 2020, pp. 7859–7863.
[7] Shaozu Yuan, Xin Shen, Yuming Zhao, Hang Liu, Zhiling Yan, Ruixue Liu, and Meng Chen, “Mcic: Multimodal conversational intent classification for e-commerce customer service,” in ICNLPCC, 2022, pp. 749–761.
[8] Hanlei Zhang, Hua Xu, Xin Wang, Qianrui Zhou, Shaojie Zhao, and Jiayan Teng, “Mintrec: A new dataset for multimodal intent recognition,” ACMMM, 2022.
[9] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NIPS, 2020.
[10] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” Tech. Rep., Technical report, OpenAI, 2022. URL https://cdn. openai. com/papers/whisper. pdf, 2022.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
[12] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu, “Tinybert: Distilling bert for natural language understanding,” 2020.
[13] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[14] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
[15] Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang, “Clap: Learning audio concepts from natural language supervision,” arXiv preprint arXiv:2206.04769, 2022.
[16] Yan Zeng, Wangchunshu Zhou, Ao Luo, and Xinsong Zhang, “Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training,” ArXiv, vol. abs/2206.00621, 2022.
[17] Vilayphone Vilaysouk, Amr Nour-Eldin, and Dermot Connolly, “Improving identification of system-directed speech utterances by deep learning of asr-based word embeddings and confidence metrics,” in ICASSP, 2021, pp. 6379–6382.
[18] Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, Tie-Yan Liu, et al., “R-drop: Regularized dropout for neural networks,” NIPS, vol. 34, pp. 10890–10905, 2021.
[19] Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu, “Balanced multimodal learning via on-the-fly gradient modulation,” in CVPR, 2022, pp. 8238–8247.