This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Vocal Effort Transfer Vector Estimation for Vocal Effort-Robust Speaker Verification

Abstract

Despite the maturity of modern speaker verification technology, its performance still significantly degrades when facing non-neutrally-phonated (e.g., shouted and whispered) speech. To address this issue, in this paper, we propose a new speaker embedding compensation method based on a minimum mean square error (MMSE) estimator. This method models the joint distribution of the vocal effort transfer vector and non-neutrally-phonated embedding spaces and operates in a principal component analysis domain to cope with non-neutrally-phonated speech data scarcity. Experiments are carried out using a cutting-edge speaker verification system integrating a powerful self-supervised pre-trained model for speech representation. In comparison with a state-of-the-art embedding compensation method, the proposed MMSE estimator yields superior and competitive equal error rate results when tackling shouted and whispered speech, respectively.

Index Terms—  Speaker verification, vocal effort, embedding compensation, shouted speech, whispered speech

1 Introduction

State-of-the-art speaker verification technology achieves impressive performance when dealing with neutrally-phonated (i.e., normal) speech [1, 2, 3]. However, because normal speech data are mostly used —due to obvious reasons— to train speaker verification systems, their performance tends to dramatically drop in the presence of non-neutrally-phonated (e.g., shouted and whispered) speech [4, 5]. To mitigate this issue, previous work [4, 5] explored a series of minimum mean square error (MMSE) techniques estimating normal speaker embeddings from non-neutrally-phonated ones. Among all of these techniques, multi-environment model-based linear normalization (MEMLIN) [6] —modeling both the normal and non-neutrally-phonated embedding spaces by Gaussian mixtures— provided the best performance in terms of equal error rate (EER) when dealing with both shouted and whispered speech [5].

Under Gaussian mixture modeling assumption, it is well known that the MMSE estimator can be expressed as a weighted sum of a set of partial estimates [7]. A shortcoming of MEMLIN is that the set of partial estimates is pre-computed (during an offline training stage) and fixed [6]. Therefore, these partial estimates do not account for the specificities of the non-neutrally-phonated embeddings observed at test time. In this paper, we propose an alternative MMSE estimator that overcomes this MEMLIN’s limitation by modeling the joint distribution of the vocal effort transfer vector111As explained in Section 2, the vocal effort transfer vector relates equivalent normal and non-neutrally-phonated speaker embeddings according to an additive model. and non-neutrally-phonated embedding spaces. Furthermore, to circumvent non-neutrally-phonated speech data scarcity, we also propose to carry out the estimation in a principal component analysis (PCA) domain. In fact, this data scarcity prevents us from leveraging deep learning for embedding compensation.

We conduct experiments employing a state-of-the-art speaker verification system consisting of the concatenation of a powerful self-supervised pre-trained model for speech representation so-called WavLM [3] and an ECAPA-TDNN [1] back-end for speaker embedding extraction. In comparison with MEMLIN, the proposal at hand shows superior and competitive EER performance when dealing with shouted and whispered speech, respectively.

The remainder of this manuscript is organized as follows. Our normal speaker embedding estimation methodology is developed in Section 2. Section 3 provides the reader with an overview of the whole vocal effort-robust speaker verification system. Section 4 is devoted to discuss the experimental results. Finally, Section 5 wraps up this work.

2 Normal Speaker Embedding Estimation

2.1 Problem Statement

Given a particular speaker, let 𝐱~D\tilde{\mathbf{x}}\in\mathbb{R}^{D} be a DD-dimensional speaker embedding extracted from an utterance with normal vocal effort. Furthermore, let 𝐲~D\tilde{\mathbf{y}}\in\mathbb{R}^{D} be a non-neutrally-phonated counterpart of 𝐱~\tilde{\mathbf{x}}. Then, we assume the following additive model:

𝐲~=𝐱~+𝐯~,\tilde{\mathbf{y}}=\tilde{\mathbf{x}}+\tilde{\mathbf{v}}, (1)

where 𝐯~D\tilde{\mathbf{v}}\in\mathbb{R}^{D} represents a vocal effort transfer vector between the normal and non-neutrally-phonated modes.

For vocal effort-robust speaker verification purposes, previous work [4, 5] proposed to estimate 𝐱~\tilde{\mathbf{x}} from an estimate of the vocal effort transfer vector, 𝐯~^\hat{\tilde{\mathbf{v}}}, by following the additive model of Eq. (1):

𝐱~^=𝐲~𝐯~^.\hat{\tilde{\mathbf{x}}}=\tilde{\mathbf{y}}-\hat{\tilde{\mathbf{v}}}. (2)

Several MMSE compensation techniques where studied in [4, 5] to realize Eq. (2), and, among them, MEMLIN [6] showed to be the best performing one. Assuming that the non-neutrally-phonated embedding domain is modeled by a KK-component Gaussian mixture model (GMM), MEMLIN —which also models the normal embedding space by another KK-component GMM— approximates 𝐱~\tilde{\mathbf{x}} from a weighted combination of KK partial estimates {𝐯~^{k};k=1,,K}\left\{\hat{\tilde{\mathbf{v}}}^{\{k\}};\;k=1,...,K\right\}:

𝐱~^=𝐲~k=1KP(k|𝐲~)𝐯~^{k}𝐯~^,\hat{\tilde{\mathbf{x}}}=\tilde{\mathbf{y}}-\underbrace{\displaystyle\sum_{k=1}^{K}P(k|\tilde{\mathbf{y}})\hat{\tilde{\mathbf{v}}}^{\{k\}}}_{\mbox{$\hat{\tilde{\mathbf{v}}}$}}, (3)

where {P(k|𝐲~);k=1,,K}\left\{P(k|\tilde{\mathbf{y}});\;k=1,...,K\right\} are the combination weights.

As introduced in Section 1, a limitation of MEMLIN is that, unlike the combination weights, the set of partial estimates {𝐯~^{k};k=1,,K}\left\{\hat{\tilde{\mathbf{v}}}^{\{k\}};\;k=1,...,K\right\} is independent of the observed non-neutrally-phonated embedding 𝐲~\tilde{\mathbf{y}}, which constrains the potentials of the Bayesian estimation framework222In MEMLIN, the set of partial estimates is pre-computed during an offline training stage [6].. In the next subsection, we propose a different MMSE compensation approach where also the partial estimates exploit 𝐲~\tilde{\mathbf{y}}, which yields significant speaker verification performance improvements in Section 4.

2.2 Estimation Methodology

Inspired by classical noise-robust speech recognition methods such as front-end joint uncertainty decoding (FE-Joint) [8] and stereo-based stochastic mapping (SSM) [9], we explore jointly modeling the vocal effort transfer vector and non-neutrally-phonated embedding domains by means of a KK-component GMM p(𝐳~=(𝐯~,𝐲~))p(\tilde{\mathbf{z}}=(\tilde{\mathbf{v}},\tilde{\mathbf{y}})). However, estimating p(𝐳~=(𝐯~,𝐲~))p(\tilde{\mathbf{z}}=(\tilde{\mathbf{v}},\tilde{\mathbf{y}})) requires the computation of 2D×2D2D\times 2D covariance matrices that are ill-conditioned under our non-neutrally-phonated speech data scarcity scenario (see Subsection 3.1). To deal with this issue, we propose to use PCA as detailed below.

Let 𝐖L\mathbf{W}_{L} be a D×LD\times L PCA transform matrix calculated from normal and non-neutrally-phonated embeddings. This matrix is comprised, column-wise, of LL principal eigenvectors, where LDL\ll D. We can express 𝐯~\tilde{\mathbf{v}} and 𝐲~\tilde{\mathbf{y}} in the PCA domain as

𝐯=𝐖L𝐯~,𝐲=𝐖L𝐲~.\mathbf{v}=\mathbf{W}_{L}^{\top}\tilde{\mathbf{v}},\hskip 28.45274pt\mathbf{y}=\mathbf{W}_{L}^{\top}\tilde{\mathbf{y}}. (4)

Then, the joint variable 𝐳=(𝐯,𝐲)\mathbf{z}=(\mathbf{v},\mathbf{y}), 𝐳2L\mathbf{z}\in\mathbb{R}^{2L}, is modeled by a KK-component GMM:

p(𝐳)=k=1KP(k)𝒩(𝐳|𝝁z{k},𝚺z{k}).p\left(\mathbf{z}\right)=\displaystyle\sum_{k=1}^{K}P(k)\mathcal{N}\left(\mathbf{z}\left|\boldsymbol{\mu}_{z}^{\{k\}},\boldsymbol{\Sigma}_{z}^{\{k\}}\right)\right.. (5)

In Eq. (5), {P(k);k=1,,K}\left\{P(k);\;k=1,...,K\right\} is the set of prior probabilities, whereas the mean vector 𝝁z{k}\boldsymbol{\mu}_{z}^{\{k\}} and covariance matrix 𝚺z{k}\boldsymbol{\Sigma}_{z}^{\{k\}} of the kk-th Gaussian density can be partitioned as

𝝁z{k}=(𝝁v{k}𝝁y{k}),𝚺z{k}=(𝚺vv{k}𝚺vy{k}𝚺yv{k}𝚺yy{k}),\boldsymbol{\mu}_{z}^{\{k\}}=\left(\begin{array}[]{c}\boldsymbol{\mu}_{v}^{\{k\}}\\ \boldsymbol{\mu}_{y}^{\{k\}}\end{array}\right),\;\;\;\boldsymbol{\Sigma}_{z}^{\{k\}}=\left(\begin{array}[]{cc}\boldsymbol{\Sigma}_{vv}^{\{k\}}&\boldsymbol{\Sigma}_{vy}^{\{k\}}\\ \boldsymbol{\Sigma}_{yv}^{\{k\}}&\boldsymbol{\Sigma}_{yy}^{\{k\}}\end{array}\right), (6)

where all 𝚺vv{k}\boldsymbol{\Sigma}_{vv}^{\{k\}}, 𝚺vy{k}=(𝚺yv{k})\boldsymbol{\Sigma}_{vy}^{\{k\}}=\left(\boldsymbol{\Sigma}_{yv}^{\{k\}}\right)^{\top} and 𝚺yy{k}\boldsymbol{\Sigma}_{yy}^{\{k\}} are L×LL\times L diagonal matrices.

Given Eq. (5), the MMSE estimate of 𝐯\mathbf{v}, 𝐯^\hat{\mathbf{v}}, is calculated as follows:

𝐯^=𝔼(𝐯|𝐲)=𝐯𝐯p(𝐯|𝐲)𝑑𝐯=k=1K𝐯𝐯p(𝐯,k|𝐲)𝑑𝐯=k=1KP(k|𝐲)𝐯𝐯p(𝐯|𝐲,k)𝑑𝐯=k=1KP(k|𝐲)𝔼(𝐯|𝐲,k)𝐯^{k},\begin{array}[]{lll}\hat{\mathbf{v}}&=&\mathbb{E}(\mathbf{v}|\mathbf{y})=\displaystyle\int_{\mathbf{v}}\mathbf{v}p(\mathbf{v}|\mathbf{y})d\mathbf{v}\vskip 4.26773pt\\ &=&\displaystyle\sum_{k=1}^{K}\int_{\mathbf{v}}\mathbf{v}p\left(\mathbf{v},k|\mathbf{y}\right)d\mathbf{v}\vskip 4.26773pt\\ &=&\displaystyle\sum_{k=1}^{K}P(k|\mathbf{y})\int_{\mathbf{v}}\mathbf{v}p(\mathbf{v}|\mathbf{y},k)d\mathbf{v}\vskip 4.26773pt\\ &=&\displaystyle\sum_{k=1}^{K}P(k|\mathbf{y})\underbrace{\mathbb{E}\left(\mathbf{v}\left|\mathbf{y},k\right)\right.}_{\mbox{$\hat{\mathbf{v}}^{\{k\}}$}},\end{array} (7)

where 𝔼()\mathbb{E}(\cdot) denotes the expectation operator. On the one hand, the combination weights in Eq. (7) are obtained, by means of the Bayes’ rule, according to

P(k|𝐲)=p(𝐲|k)P(k)k=1Kp(𝐲|k)P(k),k=1,,K,P(k|\mathbf{y})=\frac{p(\mathbf{y}|k)P(k)}{\sum_{k^{\prime}=1}^{K}p(\mathbf{y}|k^{\prime})P(k^{\prime})},\;\;\;\;\;k=1,...,K, (8)

where p(𝐲|k)=𝒩(𝐲|𝝁y{k},𝚺yy{k})p(\mathbf{y}|k)=\mathcal{N}\left(\mathbf{y}\left|\boldsymbol{\mu}_{y}^{\{k\}},\boldsymbol{\Sigma}_{yy}^{\{k\}}\right)\right.. On the other hand, given that the joint density p(𝐳=(𝐯,𝐲)|k)p(\mathbf{z}=(\mathbf{v},\mathbf{y})|k) is Gaussian, the conditional density p(𝐯|𝐲,k)p(\mathbf{v}|\mathbf{y},k) is also Gaussian, and, therefore, 𝔼(𝐯|𝐲,k)\mathbb{E}\left(\mathbf{v}\left|\mathbf{y},k\right)\right., i.e., the partial estimates in Eq. (7), can be expressed, k{1,,K}\forall k\in\{1,...,K\}, as [10]

𝔼(𝐯|𝐲,k)=𝝁v{k}+𝚺vy{k}(𝚺yy{k})1(𝐲𝝁y{k}).\mathbb{E}\left(\mathbf{v}\left|\mathbf{y},k\right)\right.=\boldsymbol{\mu}_{v}^{\{k\}}+\boldsymbol{\Sigma}_{vy}^{\{k\}}\left(\boldsymbol{\Sigma}_{yy}^{\{k\}}\right)^{-1}\left(\mathbf{y}-\boldsymbol{\mu}_{y}^{\{k\}}\right). (9)

Finally, an estimate of the normal embedding 𝐱~\tilde{\mathbf{x}} is achieved by means of Eq. (2) along with the application of the inverse PCA transform to the result of Eq. (7), namely,

𝐱~^=𝐲~𝐖L𝐯^𝐯~^.\hat{\tilde{\mathbf{x}}}=\tilde{\mathbf{y}}-\underbrace{\mathbf{W}_{L}\hat{\mathbf{v}}}_{\mbox{$\hat{\tilde{\mathbf{v}}}$}}. (10)

Note that, in order to apply this method in Section 4, both the PCA transform matrix 𝐖L\mathbf{W}_{L} and the GMM p(𝐳)p(\mathbf{z}) are calculated from a training set comprising paired normal and non-neutrally-phonated embeddings (see Subsection 3.1).

For the sake of reproducibility, a Python implementation of this speaker embedding compensation methodology has been made publicly available333https://ilopezes.files.wordpress.com/2023/06/mmsev.zip.

3 System Overview

Refer to caption𝐲~\tilde{\mathbf{y}}𝐲\mathbf{y}𝐯^\hat{\mathbf{v}}𝐯~^\hat{\tilde{\mathbf{v}}}𝐲~\tilde{\mathbf{y}}𝐱~^\hat{\tilde{\mathbf{x}}}𝐱~ref\tilde{\mathbf{x}}_{\mbox{\scriptsize ref}}scs_{c}
Fig. 1: Block diagram of the proposed vocal effort-robust speaker verification system. See the text for further details.

Figure 1 depicts a block diagram of the proposed vocal effort-robust speaker verification system. First, the powerful self-supervised pre-trained model WavLM [3] is used to compute a high-level representation of the input speech signal. Based on a Transformer structure, WavLM extends HuBERT [11] to masked speech prediction and de-noising to allow the pre-trained model to perform well in a variety of speech processing tasks including speaker verification. Second, an ECAPA-TDNN [1] back-end extracts a speaker embedding from the representation outputted by WavLM. Then, the speaker embedding compensation methodology of Section 2 is applied only in the case that the embedding comes from non-neutrally-phonated speech. To detect this case, a simple, yet virtually flawless logistic regression-based detector [4, 5] can be used. That being said, note that the results reported in Section 4 are obtained by oracle non-neutrally-phonated speech detection for the sake of simplicity. Finally, the resulting embedding is compared with a reference embedding 𝐱~ref\tilde{\mathbf{x}}_{\mbox{\scriptsize ref}} by cosine similarity to produce a score scs_{c}.

3.1 Shouted and Whispered Speech Corpora

For experimental purposes, we consider the vocal effort modes shouted and whispered in addition to normal. To this end, we employ two different (i.e., disjoint) corpora: the speech corpus informed in [12], which comprises paired shouted-normal speech utterances in Finnish from 22 speakers, and CHAINS (CHAracterizing INdividual Speakers) [13], which contains paired whispered-normal speech utterances in English from 36 speakers. Due to speech data scarcity, all the embedding compensation experiments in Section 4 are performed —as in [5]— by following a leave-one-speaker-out cross-validation strategy, which serves to split the corpora into training and test sets.

We consider the following 4 test conditions (trial lists) under the shouted-normal scenario: As{}_{\mbox{s}}-As{}_{\mbox{s}} (all shouted and normal utterances vs. all shouted and normal utterances; 557,040 trials), Ns{}_{\mbox{s}}-Ns{}_{\mbox{s}} (normal utterances vs. normal utterances; 139,128 trials), S-S (shouted utterances vs. shouted utterances; 139,128 trials) and Ns{}_{\mbox{s}}-S (normal utterances vs. shouted utterances; 278,784 trials). Furthermore, we similarly examine 4 equivalent test conditions under the whispered-normal scenario, namely, Aw{}_{\mbox{w}}-Aw{}_{\mbox{w}} (2,821,498 trials), Nw{}_{\mbox{w}}-Nw{}_{\mbox{w}} (705,078 trials), W-W (704,950 trials) and Nw{}_{\mbox{w}}-W (1,411,344 trials).

For further details about these corpora, the reader is referred to [12, 13] and [5].

Refer to caption
Fig. 2: Speaker verification results in terms of EER, in percentages, as a function of the dimensionality, after PCA application, of the embeddings processed by MEMLIN, MMSEx{}_{\mbox{x}} and MMSEv{}_{\mbox{v}}. Bar plots are shown for shouted and normal speech (top row), as well as for whispered and normal speech (bottom row).
Table 1: Speaker verification results in terms of EER, in percentages, when considering both shouted and normal speech. MEMLIN+PCA, MMSEx{}_{\mbox{x}} and MMSEv{}_{\mbox{v}} process, after PCA application, L=16L=16-dimensional embeddings.
Condition E-T+MFCC E-T+WavLM MEMLIN MEMLIN+PCA MMSEx{}_{\mbox{x}} MMSEv{}_{\mbox{v}}
As{}_{\mbox{s}}-As{}_{\mbox{s}} 19.96 17.11 15.62 31.50 28.72 15.22
Ns{}_{\mbox{s}}-Ns{}_{\mbox{s}} 9.73 7.25 7.25 7.25 7.25 7.25
S-S 11.58 9.94 10.44 27.46 25.53 5.91
Ns{}_{\mbox{s}}-S 25.28 21.76 20.74 41.00 35.56 17.74
Table 2: Speaker verification results in terms of EER, in percentages, when considering both whispered and normal speech. MEMLIN+PCA, MMSEx{}_{\mbox{x}} and MMSEv{}_{\mbox{v}} process, after PCA application, L=16L=16-dimensional embeddings.
Condition E-T+MFCC E-T+WavLM MEMLIN MEMLIN+PCA MMSEx{}_{\mbox{x}} MMSEv{}_{\mbox{v}}
Aw{}_{\mbox{w}}-Aw{}_{\mbox{w}} 16.54 11.24 8.25 31.87 23.95 8.27
Nw{}_{\mbox{w}}-Nw{}_{\mbox{w}} 1.21 0.62 0.62 0.62 0.62 0.62
W-W 4.38 5.26 4.00 19.31 19.77 2.87
Nw{}_{\mbox{w}}-W 12.81 9.81 11.47 44.38 30.59 8.86

3.2 System Implementation Details

The used ECAPA-TDNN back-end was trained, employing the additive angular margin (AAM) loss [14], on an augmented version of the VoxCeleb2 [15] dataset to extract D=256D=256-dimensional speaker embeddings. Considering an AAM loss margin of 0.2, first, WavLM —which was pre-trained on 94k hours of unlabeled speech data— was fixed and the ECAPA-TDNN parameters were trained for a total of 20 epochs. Second, WavLM and the ECAPA-TDNN back-end were jointly fine tuned for 5 epochs. Finally, by following the large margin fine-tuning strategy reported in [16], WavLM and the ECAPA-TDNN back-end were jointly trained for 2 more epochs by considering an AAM loss margin of 0.4. Notice that, for the sake of reproducibility, the model corresponding to this speaker verification system is publicly available444https://github.com/microsoft/unilm/tree/master/wavlm. The reader is referred to [3] for further information on this speaker verification system.

4 Experimental Results

In this section, EER is chosen as the speaker verification performance metric. Besides, as in previous work [4, 5], all the embedding compensation techniques evaluated make use of K=8K=8-component GMMs.

4.1 WavLM Performance

Tables 1 and 2 show speaker verification results in terms of EER under the shouted-normal and whispered-normal scenarios, respectively. The left part of these tables compare, when no embedding compensation is considered, the use of WavLM speech representations (as in Section 3), E-T+WavLM, with the use of traditional speech features, E-T+MFCC (note that E-T stands for ECAPA-TDNN). Specifically, the speaker verification system E-T+MFCC, which is publicly available555https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb, employs 80-dimensional Mel-frequency cepstral coefficients [17]. In line with [3], we can see from these tables that E-T+WavLM generally outperforms E-T+MFCC. That being said, we can also observe that there is still a large room for improvement in the presence of vocal effort mismatch (all conditions except Ns{}_{\mbox{s}}-Ns{}_{\mbox{s}} and Nw{}_{\mbox{w}}-Nw{}_{\mbox{w}}) that will be addressed by embedding compensation in the next subsections. Bear in mind that all the embedding compensation experiments in this section are carried out by employing E-T+WavLM as the baseline system.

4.2 Effect of PCA Dimension

Figure 2 plots the EER performance of the estimation methodology proposed in Section 2, MMSEv{}_{\mbox{v}}, as a function of the PCA dimension LL. For comparison, these bar plots also show results from MEMLIN (applied in the PCA domain) as well as from an MMSE estimator equivalent to that of Section 2 that directly estimates the normal embedding 𝐱~\tilde{\mathbf{x}} from 𝔼[𝐱|𝐲]\mathbb{E}[\mathbf{x}|\mathbf{y}], MMSEx{}_{\mbox{x}}. From this figure, we can see that MEMLIN’s performance tends to drop when decreasing LL as a result of the information loss caused by PCA compression, which can be particularly harmful when the estimation relies on a small set of pre-computed and fixed partial estimates.

On the other hand, MMSEv{}_{\mbox{v}} involves the computation of 2L×2L2L\times 2L covariance matrices, 𝚺z{k}\boldsymbol{\Sigma}_{z}^{\{k\}}, under a data scarcity scenario. Given our small sample size, reducing LL helps to achieve better-conditioned covariance matrices to be used in Eqs. (8) and (9). This, together with the fact that MMSEv{}_{\mbox{v}} exploits the observed non-neutrally-phonated embedding 𝐲~\tilde{\mathbf{y}} for partial estimate calculation, can explain why EER decreases up to L=16L=16 for MMSEv{}_{\mbox{v}} (see Figure 2). Keeping decreasing LL beyond this point harms speaker verification performance due to the information loss entailed by PCA compression.

In relation to MMSEx{}_{\mbox{x}}, an internal analysis revealed that estimating the normal embedding 𝐱~\tilde{\mathbf{x}} from 𝔼[𝐱|𝐲]\mathbb{E}[\mathbf{x}|\mathbf{y}] yields target and non-target score probability masses that are poorly separated as a result of compensated embeddings 𝐱~^\hat{\tilde{\mathbf{x}}} where the specific-speaker information is significantly distorted. Interestingly, we also observed that the vocal effort transfer vector 𝐯~\tilde{\mathbf{v}} has a weak speaker-dependence. Therefore, estimating 𝐱~\tilde{\mathbf{x}} as 𝐲~𝐯~^\tilde{\mathbf{y}}-\hat{\tilde{\mathbf{v}}} according to MMSEv{}_{\mbox{v}} better preserves the specific-speaker information contained in 𝐲~\tilde{\mathbf{y}}, which, in turn, leads to better-separated target and non-target score probability masses.

4.3 Embedding Compensation Performance Summary

The right part of Tables 1 and 2 compare standard MEMLIN (i.e., without PCA) with MMSEv{}_{\mbox{v}}, MMSEx{}_{\mbox{x}} and MEMLIN applied in the PCA domain (MEMLIN+PCA). Note that, in these tables, the three latter techniques process, after PCA application, L=16L=16-dimensional embeddings. Under the shouted-normal scenario (Table 1), MMSEv{}_{\mbox{v}} outperforms MEMLIN in the presence of vocal effort mismatch (i.e., in As{}_{\mbox{s}}-As{}_{\mbox{s}}, S-S and Ns{}_{\mbox{s}}-S). Furthermore, while MEMLIN is on par with MMSEv{}_{\mbox{v}} in Aw{}_{\mbox{w}}-Aw{}_{\mbox{w}} under the whispered-normal scenario (Table 2), MMSEv{}_{\mbox{v}} achieves in Nw{}_{\mbox{w}}-W a 22.7% EER relative improvement with respect to MEMLIN which actually worsens the baseline system E-T+WavLM (as in the S-S condition).

5 Concluding Remarks

In this work, we have shown that embedding compensation can significantly mitigate the speaker verification performance drop caused by vocal effort mismatch when a state-of-the-art speaker verification system integrating a cutting-edge self-supervised pre-trained model for speech representation is used. With the aim of improving a reference embedding compensation method —i.e., MEMLIN—, we have proposed an MMSE estimator of the vocal effort transfer vector that, unlike MEMLIN, exploits the non-neutrally-phonated embeddings observed at test time for partial estimate calculation and performs in a PCA domain to cope with non-neutrally-phonated speech data scarcity. Compared with MEMLIN, the proposed MMSE estimator has shown superior and competitive EER performance when processing shouted and whispered speech, respectively.

References

  • [1] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of the International Speech Communication Association, October 25-29, Shanghai, China, 2020, pp. 3830–3834.
  • [2] Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” in Proceedings of ICASSP 2022 – 47th International Conference on Acoustics, Speech, and Signal Processing, May 23-27, Singapore, 2022, pp. 6147–6151.
  • [3] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2022.
  • [4] Santi Prieto, Alfonso Ortega, Iván López-Espejo, and Eduardo Lleida, “Shouted speech compensation for speaker verification robust to vocal effort conditions,” in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of the International Speech Communication Association, October 25-29, Shanghai, China, 2020, pp. 1511–1515.
  • [5] Santi Prieto, Alfonso Ortega, Iván López-Espejo, and Eduardo Lleida, “Shouted and whispered speech compensation for speaker verification systems,” Digital Signal Processing, vol. 127, 2022.
  • [6] Luis Buera, Eduardo Lleida, Antonio Miguel, Alfonso Ortega, and Oscar Saz, “Cepstral vector normalization based on stereo data for robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 1098–1113, 2007.
  • [7] José A. González López, Reconocimiento robusto de voz con datos perdidos o inciertos, Ph.D. thesis, University of Granada, 2013.
  • [8] H. Liao and M.J.F. Gales, “Issues with uncertainty decoding for noise robust automatic speech recognition,” Speech Communication, vol. 50, pp. 265–277, 2008.
  • [9] Mohamed Afify, Xiaodong Cui, and Yuqing Gao, “Stereo-based stochastic mapping for robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, pp. 1325–1334, 2009.
  • [10] Athanasios Papoulis and S. Unnikrishna Pillai, Probability, Random Variables and Stochastic Processes (4th Edition), McGraw-Hill Europe, 2002.
  • [11] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  • [12] Cemal Hanilçi, Tomi Kinnunen, Rahim Saeidi, Jouni Pohjalainen, Paavo Alku, and Figen Ertaş, “Speaker identification from shouted speech: Analysis and compensation,” in Proceedings of ICASSP 2013 – 38th International Conference on Acoustics, Speech, and Signal Processing, May 26-31, Vancouver, Canada, 2013, pp. 8027–8031.
  • [13] Fred Cummins, Marco Grimaldi, Thomas Leonard, and Juraj Simko, “The CHAINS corpus: CHAracterizing INdividual Speakers,” in Proceedings of SPECOM, 2006, pp. 431–435.
  • [14] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, “ArcFace: Additive angular margin loss for deep face recognition,” in Proceedings of CVPR 2019 – IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 15-20, Long Beach, USA, 2019, pp. 4685–4694.
  • [15] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proceedings of INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2-6, Hyderabad, India, 2018, pp. 1086–1090.
  • [16] Jenthe Thienpondt, Brecht Desplanques, and Kris Demuynck, “The Idlab Voxsrc-20 submission: Large margin fine-tuning and quality-aware score calibration in DNN based speaker verification,” in Proceedings of ICASSP 2021 – 46th International Conference on Acoustics, Speech, and Signal Processing, June 6-11, Toronto, Canada, 2021, pp. 5814–5818.
  • [17] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, pp. 357–366, 1980.