Silence is Sweeter Than Speech:
Self-Supervised Model Using Silence to Store Speaker Information

Abstract

Self-Supervised Learning (SSL) has made great strides recently. SSL speech models achieve decent performance on a wide range of downstream tasks, suggesting that they extract different aspects of information from speech. However, how SSL models store various information in hidden representations without interfering is still poorly understood. Taking the recently successful SSL model, HuBERT, as an example, we explore how the SSL model processes and stores speaker information in the representation. We found that HuBERT stores speaker information in representations whose positions correspond to silences in a waveform. There are several pieces of evidence. (1) We find that the utterances with more silent parts in the waveforms have better Speaker Identification (SID) accuracy. (2) If we use the whole utterances for SID, the silence part always contributes more to the SID task. (3) If we only use the representation of a part of the utterance for SID, the silenced part has higher accuracy than the other parts. Our findings not only contribute to a better understanding of SSL models but also improve performance. By simply adding silence to the original waveform, HuBERT improved its accuracy on SID by nearly 2%.

Index Terms: Self-Superveised Learning, Speaker Information

1 INTRODUCTION

Self-Supervised Learning (SSL), which utilizes unlabeled data to learn, has achieved many milestones in different fields. In the speech field, there are several outstanding SSL models proposed in recent years [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. We can leverage representations from these SSL models with a simple downstream model and perform well in different speech tasks [15]. This shows that the SSL models have universal characteristics. The representations integrate different information, such as content, semantic, and speaker information, into the low dimension representations.

Despite the achievements of SSL models in various speech tasks, there are only limited research and analysis about SSL models and their representations. All previous works focus mainly on the performance of different layers or types of models [16, 17, 18, 19]. There is currently no research on how the SSL model stores such complex speech information in representations. In this work, we choose speaker information as the entry point to demystify the SSL model. Instead of layer-wise analysis like in previous works, we propose to analyze the speaker information through the position of the waveform. The way SSL models store speech information into different parts of an utterance hasn’t been studied well. We hypothesize that the SSL model may keep different types of information in different positions, helping the model better disentangle and store various speech information.

We adopt Speaker Identification (SID) task to measure speaker information in the representations. We first find that SSL model tends to store speaker information in the last fragment, which contains more silence. We then further analyze the relationship between the percentage of silence and SID accuracy. To our surprise, the amount of silence in utterances is highly correlated with accuracy on the SID task, which means that silence has some relationship with speaker information for the SSL model. To get more evidence, we add silence in utterances, split them into fragments, and use only one fragment to train the SID task at a time. We find that the fragment corresponding to the silence part can get the highest score among all fragments. Besides, we can get better accuracy in the SID task by simply adding silence during training. All the evidence points out that silence will help the SSL model store speaker information. To the best of our knowledge, this is the first work using position information to analyze the representation characteristic of the SSL model in speech and explore the storage mechanism of the SSL model.

Refer to caption — Figure 1: (a) Experiment structure in Section 3 and Section 4.2. The dotted square represent the silence fragment, which is added only in Section 4.2. The input of the downstream model is the weighted sum of the HuBERT representations. (b) Experiment structure in Section 4.3. The downstream model takes only one fragment as input at a time.

2 RELATED WORK

According to the success of the SSL model in recent years, there are more and more papers trying to analyze the key to the success of the SSL model. However, the current papers which analyze the SSL model mainly focus on the text field [20, 21, 22, 23, 24, 25, 26]. For example, [27] analyzes how attention maps in BERT process linguistic knowledge. As for [28], this paper focuses on how the SSL model learned the language information during the training process. There are still seldom papers contributing to the study of the SSL model in the speech field.

There are two research directions in the limited papers for analysis of the speech SSL model. The first direction is comparing speech SSL models by training criteria. [16] shows that training criteria have a high relationship with the performance of downstream tasks. The second direction is to study the SSL model from the perspective of content. For example, [17] uses Canonical Correlation Analysis (CCA) to investigate which information is encoded by each layer of the SSL model. Inspired by these prior works, we want to explore the logic behind the SSL model. However, unlike previous works, we mainly focus on investigating how the information is stored in representations of the SSL model and how the input waveform will cooperate with the SSL model.

3 WHERE IS THE SPEAKER INFORMATION

Some previous works have shown that SSL models tend to store speaker information in the middle layers [18, 19]. However, how the information is stored in different positions within an utterance remains unexplored. In this section, we first analyze the importance of representations from different positions for the SID task.

The setup of the SID task here is modified from SUPERB [15]. We follow the setting in [15] to split the VoxCeleb dataset [29] into training and testing sets. We use HuBERT as the upstream model. Unlike the experiments in the SUPERB benchmark, to analyze each layer, we use a single layer representation instead of the weighted sum of all layers.

In the original setting of the SID task in SUPERB, the downstream model is a pooling layer and a liner layer. Here we modify the pooling layer to further investigate whether SSL models store speaker information in all representations equally. The modified SID process is shown in Figure 1(a). We first equally segment the sequence of representations from a specific layer in the upstream model into 10 fragments. We do mean pooing among the frame-level representations in each fragment to obtain a representation for each fragment $f_{i}$ ( $i=1$ to $10$ ). Then we assign a learnable weight $w_{i}$ to the $i$ -th fragment, which is learned with the downstream model. We multiply the representation of each fragment by the learnable weight, and all the fragment representations are added to obtain a single representation $f$ for the whole utterance:

f=\sum_{i=1}^{10}w_{i}f_{i}

(1)

Such a design allows us to observe whether representations in different positions of an utterance have various contributions to SID. If a specific fragment has a larger weight than the others, we can hypothesize that it is more critical to the SID task.

However, it does not always mean that the corresponding fragment is critical if we only consider the absolute value of its weight. If a fragment’s representation has small values, its weight may become large to balance the small values in the representation. Therefore, we consider the values of weights and representations together and introduce norm weight $\bar{w}_{i}$ to indicate the importance of a fragment. The main idea of norm weight $\bar{w}_{i}$ is multiplying the learnable weight $w_{i}$ for each fragment with the L2 norm of the fragment’s representation $\|f_{i}\|$ .

\bar{w}_{i}=w_{i}\|f_{i}\|

(2)

Figure 2 shows the average of the norm weights $\bar{w}_{i}$ of the fragments across the testing set. For a reasonable comparison between utterances, we normalize the sum of all norm weights in the same utterance to 1 before taking the average. We find that the last fragment of the waveform always contributes the most among all the fragments. After investigating the property of the last fragment, we find that the last fragment contains higher portions of silence than other fragments.

To further verify that the observation is only caused by the content rather than the order of the fragments, we randomly perturb the order of the fragments in each testing utterance and evaluate them with the same trained downstream model. Figure 3 shows the results. All utterances perform the same perturbation. After the perturbation, the original last fragment (fragment 10) is swapped to the middle of the utterance while still having the highest norm weight. The other perturbed fragments also remain weights close to the original values, showing that it is not the position but the content of a segment that affects its norm weight. Based on the finding, we hypothesize that silence fragments may be more related to the storage of speaker information.

Table 1: SID accuracy using different fragments of HuBERT representations. The first column is the fragment number, including the silence fragment (S). The second and third columns are the silences connected at the front and end of the input waveform. The fourth column is the waveform without adding silence; the percentage in parentheses is the difference in accuracy between No Silence and Silence Front. We use McNemar’s test to see if there is a significant difference between each pair of Silence Front and No Silence, and

{\dagger}

denotes that the results are significantly different.

$F_{i}$	Silence Front	Silence End	No Silence
S	0.499	X	X
1	0.337	0.382	0.418(+18%)^†
2	0.364	0.364	0.418(+12%)^†
3	0.374	0.415	0.440(+14%)^†
4	0.376	0.425	0.425(+11%)^†
5	0.378	0.425	0.434(+12%)^†
6	0.373	0.428	0.430(+13%)^†
7	0.380	0.431	0.439(+13%)^†
8	0.400	0.433	0.423(+5%)^†
9	0.410	0.430	0.429(+4%)
10	0.416	0.354	0.423(+1%)
S	X	0.536	X

Table 2: SID accuracy of different SSL models and silence lengths. The first column is the place where silence was added to the original waveform; the second column is the silence length compared to the original waveform. The other columns are the performance of three different models. We use McNemar’s test to see if there is a significant difference between each pair of Baseline and other settings in a specific model, and

{\dagger}

denotes that the results are significantly different.

Silence position	Silence length	HuBERT-Base	HuBERT-Large	wav2vec2
Baseline	X	0.807	0.890	0.739
Front	$1/5$	0.803	0.874	0.735^†
Front	$1/10$	0.824^†	0.892^†	0.748^†
Front	$1/20$	0.818^†	0.888^†	0.744
End	$1/5$	0.801	0.878^†	0.724^†
End	$1/10$	0.816^†	0.884	0.747^†
End	$1/20$	0.813^†	0.883^†	0.746^†

4 WHERE CAN WE FIND SPEAKER? SILENCE

At the end of Section 3, we hypothesize that silence correlates with speaker information. In this section, we do more experiments to verify their relationship. In Section 4.1, we show the correlation between the performance of SID with the portion of silence in an utterance. Then in Sections 4.2 and 4.3, we insert the silence into waveforms and show the importance of the silence part to the SID task.

4.1 Amount of Silence vs SID Task Performance

After investigating the property of the last fragments and finding out that the last fragments contain a lot of silence, we want to know whether the amount of silence in the utterance affects the accuracy of the SID task. If the amount of silence and accuracy in the SID task are positively correlated, we can assume silence has some relationship with speaker information.

In this experiment, we use the test dataset of VoxCeleb as the measurement object. We use the Librosa toolkit to measure the ratio of the length of speech less than 10db to the length of the original speech in each utterance. The final result is shown in Figure 4. We find that if silence is less than a threshold, about 5%, performance will be reduced by about 30% to 50% compared to other cases. With the results of this experiment, we observe that silence strongly correlates with the accuracy of the SID task in this experiment. This means that the silent part of the waveform is indeed related to the speaker’s information.

4.2 Silence is Important For SSL Model to Store Speaker Information

To measure the importance of silence, we add a silence fragment to the waveform, which length is 1/10 of the original waveform. The positions where we insert the silence fragment into the waveform are the original waveform’s front/middle/end, respectively. After that, following the analysis method in Section 3, we segment a representation sequence into 11 fragments (10 corresponding to the original waveform and 1 corresponding to the silence part we insert) and do the same analysis as in Section 3. We hope to confirm whether the silence fragments contribute more to the SID task by observing the norm weights. The dataset we use is also the test dataset of VoxCeleb as in Section 3. Unlike the previous experiments, to confirm that our results are general, we use three models in this section: HuBERT-Base, HuBERT-1Iter (HuBERT only trains for one iteration), and HuBERT-Large. Furthermore, we use the weighted sum of the outputs of each layer as the representation of these three models.

The outcome is shown in Figure 5. We find that no matter where we insert silence in the front/middle/end of the original waveform, the norm weight of the silence fragment is always the largest, which means that the downstream model will mainly use this fragment to classify the speakers. We also do the same analysis on Automatic Speaker Verification (ASV). The setup of the ASV task here is the same as in SUPERB, except that the pooling layer is modified as in Section 3. Besides, this phenomenon also occurs in the ASV probing task, which is also shown in Figure 5. Figure 5 shows that the representations corresponding to silence fragments store more speaker information than other representations.

4.3 Do Silence Really Important for SSL Models? Yes

In Section 4.2, the norm weights show that the silence fragments are essential for speaker information. This section provides more experiments to verify the silence fragments store more speaker information than the others. This subsection also separates a representation sequence into multiple fragments, as in previous experiments. Still, unlike the earlier experiments, we only pick one of the fragments as the input of the SID downstream model. Besides, we pick HuBERT as our upstream model and use the weighted sum of the outputs of each layer as the representation. The process of the experiments in this subsection is shown in Figure 1(b).

The results are shown on Table 1. In the columns labeled with ”Silence Front” and ”Silence End”, we added the silence in the front or end of the original waveform, respectively. Then we separate the representations into 11 fragments (10 corresponding to the original waveform, 1 corresponding to the silence part we insert). Compared to other fragments, the silence fragment outperforms other fragments with about 10% accuracy whether silence is added at the front or end. This means that the silence fragments contain more speaker information, so the SID performance is better than other fragments.

In the column labeled ”No Silence”, we don’t add silence in the original waveform. After getting the representations, we separate a representation sequence into 10 fragments and also pick one of them to train the downstream model. The accuracy of non-silence fragments in the waveform, which adds the silence, in the beginning, is lower than the accuracy of corresponding fragments, which don’t insert the silence. The results show that silence fragments aggregate speaker information from other fragments. These two experiments show that the silence fragments contain the speaker information and aggregate speaker information from other fragments. The observations suggest that adding silence can serve as an alternative to disentangling speakers and content information. We leave the idea as our future work.

5 Silence Can Help to Improve the SID Task

The previous experiments found that silence parts in waveforms will help the SSL model store speaker information. In this subsection, we try to utilize the above findings to improve the performance of the SID task. Inspired by the result of Section 4.3, we directly add silence in all the data in VoxCeleb. The place we add is the front or end, and the length of silence is equal to 1/5, 1/10, and 1/50 of the length of the original waveform. We use this modified dataset to train on the SID task.

The outcome is shown in Table 2. We evaluated three different upstream models here: HuBERT, HuBERT-Large, wav2vec2. We use layer-wise weighted sum strategies in this subsection as in SUPERB, so the results of SID here are much better than the previous experiments and comparable with the results of the SUPERB benchmark. We find out that by using this native strategy to modify the original dataset, SID accuracy increases about 2% for HuBERT upstream model, and accuracy increases about 0.2% for HuBERT-Large. The results show that adding silence into waveforms can efficiently help the model learn speaker information. Furthermore, We found that adding too much silence or too little silence is not a good thing. All three models perform best when the length of silence is 1/10. This may be because the utterance itself contains some silence, so adding too much silence will make the model disperse speaker information. Besides, HuBERT Large may be unable to make significant progress because its original performance is already as high as 90%, so the phenomenon that adding silence still can help HuBERT-Large to improve in the SID task represent silence can help store speaker information.

6 CONCLUSION

In this work, we utilize SID probing task and HuBERT model to understand how the SSL model processes and stores the speaker information in the waveform. We use Upstream-Downstream model structure to do probing tasks to understand speaker information. We fix HuBERT model as our upstream model and use a simple linear layer as our downstream model. During the experiments, we find out that speaker information is stored in the silence fragment. Besides, the data shows that when the ratio of silence in the waveform is lower than 5%, the accuracy of the SID task will decrease by about 30% to 50%. Last but not least, when adding some silence in the original waveform, the SID task can increase up to 2% without fine-tuning the upstream model. These findings help us to gain more insight into how SSL models process speaker information and can help others to have another idea when dealing with speaker-related tasks. In future directions, we will investigate whether different information is also stored in a specific place in the representations. For example, whether content information is mainly stored in the sound fragments. Moreover, we will also examine more SSL with different structures and criteria.

7 ACKNOWLEDGEMENTS

We are grateful to Abdelrahman Mohamed for his comments and discussions of this paper.

References

[1] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” CoRR, vol. abs/2106.07447, 2021. [Online]. Available: https://arxiv.org/abs/2106.07447
[2] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” CoRR, vol. abs/2006.11477, 2020. [Online]. Available: https://arxiv.org/abs/2006.11477
[3] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-supervised learning of discrete speech representations,” CoRR, vol. abs/1910.05453, 2019. [Online]. Available: http://arxiv.org/abs/1910.05453
[4] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” CoRR, vol. abs/1904.05862, 2019. [Online]. Available: http://arxiv.org/abs/1904.05862
[5] A. T. Liu, S.-W. Yang, P.-H. Chi, P. chun Hsu, and H. yi Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders.” CoRR, vol. abs/1910.12638, 2019. [Online]. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9054458
[6] S. Ling and Y. Liu, “Decoar 2.0: Deep contextualized acoustic representations with vector quantization,” ArXiv, vol. abs/2012.06659, 2020.
[7] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018. [Online]. Available: http://arxiv.org/abs/1807.03748
[8] Y. Chung, W. Hsu, H. Tang, and J. R. Glass, “An unsupervised autoregressive model for speech representation learning,” CoRR, vol. abs/1904.03240, 2019. [Online]. Available: http://arxiv.org/abs/1904.03240
[9] Y.-A. Chung, H. Tang, and J. Glass, “Vector-quantized autoregressive predictive coding,” arXiv preprint arXiv:2005.08392, 2020.
[10] P.-H. Chi, P.-H. Chung, T.-H. Wu, C.-C. Hsieh, Y.-H. Chen, S.-W. Li, and H.-y. Lee, “Audio albert: A lite bert for self-supervised learning of audio representation,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 344–350.
[11] A. T. Liu, S.-W. Li, and H.-y. Lee, “Tera: Self-supervised learning of transformer encoder representation for speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2351–2366, 2021.
[12] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualized acoustic representations for semi-supervised speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6429–6433.
[13] H. Chang, S. Yang, and H. Lee, “Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit BERT,” CoRR, vol. abs/2110.01900, 2021. [Online]. Available: https://arxiv.org/abs/2110.01900
[14] A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” CoRR, vol. abs/2202.03555, 2022. [Online]. Available: https://arxiv.org/abs/2202.03555
[15] S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, T. Huang, W. Tseng, K. Lee, D. Liu, Z. Huang, S. Dong, S. Li, S. Watanabe, A. Mohamed, and H. Lee, “SUPERB: speech processing universal performance benchmark,” CoRR, vol. abs/2105.01051, 2021. [Online]. Available: https://arxiv.org/abs/2105.01051
[16] Y.-A. Chung, Y. Belinkov, and J. Glass, “Similarity analysis of self-supervised speech representations,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3040–3044.
[17] A. Pasad, J. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” CoRR, vol. abs/2107.04734, 2021. [Online]. Available: https://arxiv.org/abs/2107.04734
[18] S. Chen, Y. Wu, C. Wang, Z. Chen, Z. Chen, S. Liu, J. Wu, Y. Qian, F. Wei, J. Li, and X. Yu, “Unispeech-sat: Universal speech representation learning with speaker aware pre-training,” CoRR, vol. abs/2110.05752, 2021. [Online]. Available: https://arxiv.org/abs/2110.05752
[19] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” arXiv preprint arXiv:2110.13900, 2021.
[20] J. Vig and Y. Belinkov, “Analyzing the structure of attention in a transformer language model,” CoRR, vol. abs/1906.04284, 2019. [Online]. Available: http://arxiv.org/abs/1906.04284
[21] Y. Lin, Y. C. Tan, and R. Frank, “Open sesame: Getting inside bert’s linguistic knowledge,” CoRR, vol. abs/1906.01698, 2019. [Online]. Available: http://arxiv.org/abs/1906.01698
[22] J. Vig, “A multiscale visualization of attention in the transformer model,” CoRR, vol. abs/1906.05714, 2019. [Online]. Available: http://arxiv.org/abs/1906.05714
[23] Y. Hao, L. Dong, F. Wei, and K. Xu, “Visualizing and understanding the effectiveness of BERT,” CoRR, vol. abs/1908.05620, 2019. [Online]. Available: http://arxiv.org/abs/1908.05620
[24] S. Vashishth, S. Upadhyay, G. S. Tomar, and M. Faruqui, “Attention interpretability across NLP tasks,” CoRR, vol. abs/1909.11218, 2019. [Online]. Available: http://arxiv.org/abs/1909.11218
[25] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky, “Revealing the dark secrets of BERT,” CoRR, vol. abs/1908.08593, 2019. [Online]. Available: http://arxiv.org/abs/1908.08593
[26] F. Dalvi, H. Sajjad, N. Durrani, and Y. Belinkov, “Exploiting redundancy in pre-trained language models for efficient transfer learning,” CoRR, vol. abs/2004.04010, 2020. [Online]. Available: https://arxiv.org/abs/2004.04010
[27] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, “What does BERT look at? an analysis of bert’s attention,” CoRR, vol. abs/1906.04341, 2019. [Online]. Available: http://arxiv.org/abs/1906.04341
[28] D. C. Chiang, S. Huang, and H. Lee, “Pretrained language model embryology: The birth of ALBERT,” CoRR, vol. abs/2010.02480, 2020. [Online]. Available: https://arxiv.org/abs/2010.02480
[29] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” CoRR, vol. abs/1706.08612, 2017. [Online]. Available: http://arxiv.org/abs/1706.08612

Silence is Sweeter Than Speech: Self-Supervised Model Using Silence to Store Speaker Information