This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Constrained speaker linking

Abstract

In this paper we study speaker linking (a.k.a. partitioning) given constraints of the distribution of speaker identities over speech recordings. Specifically, we show that the intractable partitioning problem becomes tractable when the constraints pre-partition the data in smaller cliques with non-overlapping speakers. The surprisingly common case where speakers in telephone conversations are known, but the assignment of channels to identities is unspecified, is treated in a Bayesian way. We show that for the Dutch CGN database, where this channel assignment task is at hand, a lightweight speaker recognition system can quite effectively solve the channel assignment problem, with 93 % of the cliques solved. We further show that the posterior distribution over channel assignment configurations is well calibrated.

Index terms: Speaker recognition, linking, partitioning, clustering, Bayesian.

1 Introduction

Speaker linking [1], also known as speaker partitioning [2], is the general problem of finding which utterances in a collection of recordings are spoken by the same speaker. It resembles speaker diarization [3], in that usually no prior training material of any of the speakers is given, but in speaker diarization there is the additional problem of segmentation, with the possibility of overlapping speech [4]. In [2] it is shown that the task covers many different problems in the area of speaker recognition, ranging from traditional ‘NIST SRE-style’ speaker detection to speaker counting and unsupervised adaptation. The authors approach the problem in a purely Bayesian way, meaning that the task is defined as computing a posterior distribution over all possible speaker partitions, given the data and a prior distribution over the partitions. Because of the combinatoric explosion of the number of partitions, the Bayesian approach can only be taken for small-sized problems. A different approach to essentially the same problem was taken in [1], where a solution was sought in terms of agglomerative clustering, i.e., making sequences of speaker linking decisions, concentrating on large scale problems. Speaker linking has found application in large scale speaker diarization [5, 6, 7], where the task of speaker diarization within a single recording is extended to finding the same speaker across multiple recordings, and to very long recordings for which the speaker diarization problem becomes computationally challenging [5].

In this paper, we are investigating how prior information in the speaker linking problem can be used. We will use a probabilistic approach following [2], and analyse prior information in terms of uncertainty. Then, we will apply this to a specific, but remarkably common situation in telephone speech data collections. This is the case where speaker identities in telephone conversations are known, but the assignment of identities to the two channels in the recorded conversation is unknown. We refer to this problem as the channel assignment task. In The Netherlands, we have seen this situation more than once. In the telephone interception recordings made in police investigations, both parties of a conversation are recorded in separate channels. The ‘identity’ of the speaker in such a case is limited to technical metadata such as the telephone number or the IMSI and IMEI numbers of the SIM and handset, respectively. This metadata is stored in the interception database, but for a reason unknown to us, it is impossible to tell which identity is recorded in which channel of the 2-channel audio file. From a forensic point of view, this does not pose a problem. A specific person is under investigation, and this means that calls to/from this person’s telephone are intercepted. When in the content of the call there is incriminating evidence, it is that fragment that is going to be important. If the speaker identity of that fragment is questioned, forensic speaker comparison is going to play a role in the case. But even then, the channel assignment of the phone number is not important. This is different when we want to employ data mining approaches to all recordings in a police investigation. Then the link between identity and channel is very relevant indeed.

A second example is the data collection ‘Dutch Spoken Corpus’ (Corpus Gesproken Nederlands, CGN [8]). This is a large general speech database of contemporary Dutch spoken in the Netherlands and Belgium, with a wide variety of speech sources and speaking styles, and annotated at various levels of details. As such, it is widely used by researchers in linguistics, language and speech technology in these countries. In 2008, the data was used as training material in the project N-Best, an evaluation of Dutch speech recognition systems [9]. At the preparation stage it turned out that, despite the multitude of annotations available with CGN, the mapping of orthographic annotation (and speaker identities) to channels within a telephone conversation was unknown. A possible reason for this is that the telephone conversations recorded as two channels only contribute to a small portion of the entire CGN, and that despite the very broad application scenario included in the design of CGN, requirements for automatic speaker recognition may not have been fully worked out. At the time, additional manual speaker attribution was made and distributed to the participants of N-Best. Participants reported different strategies for dealing with the originally missing speaker-to-channel annotation [10, 11].

Other situations where speaker identities are known, but the exact mapping is unknown are large scale diarization problems. E.g., in [6], a sequence of meetings is processed with diarization, and later the speaker linking between meetings is carried out. Sometimes the speaker information is partially known, as in the case of the speaker diarization of broadcast shows where TV-guide metadata can reveal the names of some of the speakers in the shows [12].

2 Reducing uncertainty

We are going to describe the speaker linking problem in terms of uncertainty, where both prior information and speaker recognition can contribute to lowering the uncertainty in the speaker identities in the database. We will do this in terms of the total entropy in the database, parameterized by size and other prior information. The number of recordings will be denoted by 2M2M, anticipating that we are going to investigate a collection of MM conversations, each contributing two separate recordings, shortly. In the following, we will consecutively add constraints, or prior information, to the analysis.

2.1 Speaker-homogeneous recordings

The overall restriction in this paper is that each speech recording contains speech from only a single speaker. This excludes the tasks of speaker diarization. The speaker entropy is HI=logB2MH_{\rm I}=\log B_{2M}, where BnB_{n} is the nnth ‘Bell number’ [2]. Here we have applied no further priors to the number of speakers and their distribution, not even that—for a telephone call—a speaker can’t talk to herself.

2.2 Number of speakers is known

When additionally the number of speakers NN in the database is known, the entropy is reduced to HII=log{2MN}H_{\rm II}=\log{2M\atopwithdelims\{ \}N}, where {nk}n\atopwithdelims\{ \}k denotes the Stirling number of the second kind. An upper bound approximation is 2MlogNlogN!2M\log N-\log N!, i.e., for each recording the identity can be any of NN speakers, and we must compensate for the arbitrary speaker labelling.

2.3 Telephone conversations

When we know that the conversations are all telephone conversations, we can exclude situations where one speaker occurs in both sides of the conversation. This reduces the entropy just a little further by approximately MlogNN1M\log\frac{N}{N-1}.

2.4 Speakers in conversations are known

We now make a big step to the channel assignment task described in the introduction: the speaker identities of the parties participating in a conversation are known, but it is unknown in which channel which speakers is. This prior results in the reduction of entropy to HIV=Mlog2H_{\rm IV}=M\log 2. If the entropy is expressed in bits, we have HIV=MH_{\rm IV}=M.

2.5 Application of speaker recognition

Speaker linking can further reduce the entropy. For instance, if a single speaker occurs in all MM conversations, and there are M+1M+1 speakers in total, it is easy to see that, if speaker recognition works flawlessly, the entropy can be reduced to 0. However, there are many different partitionings of the speakers over the calls possible, and not all will have te same potential w.r.t. entropy reduction even with a perfect speaker recognition system. For instance, if the same two speakers occur in all MM conversations, there is still log2\log 2 entropy left. And in the extreme situation that there are 2M2M speakers in the database, the entropy remains at Mlog2M\log 2 as before.

The entropies described in the previous paragraph are potentially attainable entropies in the case of perfect speaker linking. However, speaker recognition is based on statistical models of speech and speakers, and can make errors. We therefore need a different measure to compute the effect of speaker recognition, and we will use the cross entropy for that.

3 Speaker recognition

We will use de detection capability of a speaker recognition system as the information used in speaker linking. We assume that when two utterances (different conversation sides) xx and yy are available, the recognizer can provide a log-likelihood-ratio for comparing the two

λ(x,y)=logP(x,y1)P(x,y2)\lambda(x,y)=\log\frac{P(x,y\mid{\cal H}_{1})}{P(x,y\mid{\cal H}_{2})} (1)

where 1{\cal H}_{1} and 2{\cal H}_{2} are the hypotheses that xx and yy are spoken by the same or different speakers, respectively. We assume that the system is well calibrated, i.e., that λ\lambda can be used effectively to compute a minimum risk Bayes decision.

We will utilize the speaker detector for analyzing various sub-partitionings of the telephone channel assignment problem. We will have to introduce some notation at this point. A conversation labelling is the way speaker pairs are distributed over telephone conversations, i.e., without explicitly encoding which speaker is in which channel. If such a conversation labelling encompasses MM conversations, there are 2M2^{M} possible configurations 𝒞{\cal C} of the speaker pairs over the channels in the conversations. These can be numbered in binary notation, using {\ell}\ and r{r} instead of traditional 0’s and 1’s. The channel assignment problem of Section 2.4 is to find the correct configuration given the conversation labelling and the speech data.

For a given configuration 𝒞{\cal C} of speakers over recordings, the likelihood is L𝒞=P(X𝒞)L_{\cal C}=P(X\mid{\cal C}), where XX denotes the relevant speech. Given the prior probability for this configuration π𝒞\pi_{\cal C} and the likelihoods, the posterior can be computed using

P(𝒞X)=π𝒞L𝒞𝒞π𝒞L𝒞,P({\cal C}\mid X)=\frac{\pi_{\cal C}L_{\cal C}}{\sum_{{\cal C}^{\prime}}\pi_{{\cal C}^{\prime}}L_{{\cal C}^{\prime}}}, (2)

where the summation is over all 2M2^{M} possible configurations 𝒞{\cal C}^{\prime}. For the stated task, it is not unreasonable to set the prior uniformly at πi=2M\pi_{i}=2^{-M}.

In Table 1 we have summarized the way the reduction of entropy evolves by specifying more constraints. As a numerical example, we have used the test data in the experiment explained in Section 4.

Table 1: Some example values for the entropies of the speaker paritioning problem. We have used the parameters of the test data from CGN (348 conversations involving 356 speakers) as an example of the reduction in entropy HH, which is expressed in bits here. FF is the average confusion, see Section 4.1.
Sec- Expression CGN
tion HH HH FF
2.1 HI=logB2MH_{\rm I}=\log B_{2M} 4163.2 3991
2.2 HII=log{2MN}H_{\rm II}=\log{2M\atopwithdelims\{ \}N} 3292.7 704
2.3 HIII=HIIMlogNN1H_{\rm III}=H_{\rm II}-M\log\frac{N}{N-1} 3291.3 702
2.4 HIV=Mlog2H_{\rm IV}=M\log 2 348 1.0

3.1 Single speaker chains

Let us first consider the simplest case, a set of two conversations with a single common speaker. We denote this ‘target speaker’ by aa, occurring twice, with conversation partners bb and cc, respectively. There are four possible configurations over the two channels ,r{\ell},{r} for the conversations 1 and 2, namely (ab,ac)(ab,ac), (ab,ca)(ab,ca), (ba,ac)(ba,ac) and (ba,ca)(ba,ca), cf. Fig. 1. Using LnormL_{\rm norm} to denote the likelihood that all recordings have different identities, the log-likelihood for the first in the four configurations is

logL1=logLnorm+λ(1,2),\log L_{1}=\log L_{\rm norm}+\lambda(1{\ell},2{\ell}), (3)

with similar likelihoods L2,,L4L_{2},\ldots,L_{4} for the other configurations. Note that with (2), the factor LnormL_{\rm norm} cancels in the posteriors PiP_{i}. A maximum posterior linking decision would therefore correspond to a clustering step based on maximum likelihood.

aa & bbaa & cc{\ell}\vphantom{{r}}r{r}\vphantom{{\ell}}{\ell}\vphantom{{r}}r{r}\vphantom{{\ell}}{\ell}\vphantom{{r}}r{r}\vphantom{{\ell}}{\ell}\vphantom{{r}}r{r}\vphantom{{\ell}}{\ell}\vphantom{{r}}r{r}\vphantom{{\ell}}{\ell}\vphantom{{r}}r{r}\vphantom{{\ell}}{\ell}\vphantom{{r}}r{r}\vphantom{{\ell}}{\ell}\vphantom{{r}}r{r}\vphantom{{\ell}}
Figure 1: The four linking configurations for a single speaker chain in 2 conversations. The link indicates the channel of aa.

Next, we will consider a situation with MM conversations, each with the same ‘target speaker’ aa, and all different conversation partners. Using Lnorm2ML_{\rm norm}^{2M} to denote the likelihood that all 2M2M recordings are from different speakers, a first approximation to the likelihood L~1\tilde{L}_{1} for the first configuration (ab,ac,,aM)(ab,ac,\ldots,aM) could be computed using

logL~1=logLnorm2M+i=1M1λ(i,(i+1)),\log\tilde{L}_{1}=\log L_{\rm norm}^{2M}+\sum_{i=1}^{M-1}\lambda(i{\ell},(i+1){\ell}), (4)

i.e., making the “link” between assumed speakers aa in channel {\ell} from 1 to 2, from 2 to 3, etc. But this is just one possible linking, and there are in fact (M2)M\choose 2 possible log-likelihood-ratios, from which a chain of M1M-1 need to be chosen. A better approximation therefore is to include them all, and scale these to M1M-1 contributions:

logL1=logLnorm2M+2Mi<jλ(i,j).\log L_{1}=\log L_{\rm norm}^{2M}+\frac{2}{M}\sum_{i<j}\lambda(i{\ell},j{\ell}). (5)

3.2 Re-occurring conversation partners

Thus far we have considered that case that all conversation partners of a particular target aa are different. However, it is likely that in some conversations the same partner bb occurs. This leads to two different cases:

unresolvable

Target aa and bb are always paired in the database. In this case, a relative linking is possible, but the absolute attribution of the speakers to the links cannot be resolved.

resolvable

There is more than one conversation partner for aa (or bb). In principle, the speaker linking could be resolved by speaker recognition.

If we further concentrate on the subset of the database involving speaker aa, then multiple occurrences of bb can add to the log-likelihood. The additional contribution of speaker bb to the log-likelihood of the 1st configuration is, similar to (5)

2Mbi<jλ(ir,jr)\frac{2}{M_{b}}\sum_{i<j}\lambda(i{r},j{r}) (6)

where the sum is over all MbM_{b} conversations involving bb as a conversation partner of aa—for the first configuration bb is in channel r{r}. This average log-likelihood ratio should be added to L1L_{1} of (5), and similar terms to the log-likelihoods of the other configurations.

3.3 Independent cliques

If there are two sets of speakers 𝒮1{\cal S}_{1} and 𝒮2{\cal S}_{2} in the database that do not share a single conversation, we call these different cliques. When 𝒞1{\cal C}_{1} is a configuration within all conversations X1X_{1} involving 𝒮1{\cal S}_{1}, and 𝒞2{\cal C}_{2} correspondingly for 𝒮2{\cal S}_{2}, then the posterior

P(𝒞1,𝒞2X1,X2)=P(𝒞1X1)P(𝒞2X2),P({\cal C}_{1},{\cal C}_{2}\mid X_{1},X_{2})=P({\cal C}_{1}\mid X_{1})P({\cal C}_{2}\mid X_{2}), (7)

i.e., the configurations can be treated independently, reducing the total number of configurations that need to be computed. If the design of the database does not allow for the analysis of separate cliques, the problem of computing the posterior distribution over all configurations of the database will still be intractable for large MM.

4 Experiments

In this section we apply the constrained speaker linking approach sketched above to ‘NL/component c’ of the Dutch CGN database [8], which consists of telephone conversations between acquaintances in The Netherlands, recorded using a telephony platform. As mentioned before this database actually lacks the channel-to-speaker assignment, but within the context of the N-Best project [9] this information had been manually added. We treat this part of CGN as a prototypical example of the channel assignment task, and use the manually added reference for evaluation purposes. It consists of 352 conversations (704 sides) involving 357 different speakers.

The speaker recognition system we use is a lightweight ‘UBM/GMM dot-scoring’ system [13] implemented entirely in the new high performance language for numerical computing Julia.111http://julialang.org/ We use a standard acoustical front end based on 20 MFCC’s plus first and second derivatives, energy-based Speech Activity Detection, and 4 second feature warping [14]. The 1024 component UBM was trained gender-independently on ‘NL/component d’ (telephone conversations recorded using a mini-disc, 600 conversations, 176 speakers, 21 hours). The speaker comparison score is a linear approximation to the MAP-adapted [15, 16] GMM / UBM log likelihood ratio score. For probabilistically interpretable likelihood ratios (1) we self-calibrated the collection of test scores using CMLG [17] before applying (2).222This procedure is comparable to determining CdetminC_{\rm det}^{\rm min} in speaker detection evaluation. The equal error rate of the system evaluated on a full score matrix of all recordings of the test data is E==7.0%E_{=}=7.0\,\%.

NL/component c of CGN consist of many independent cliques, presumably as a result of way the speakers were recruited. There are 125 cliques of 2–5 speakers, each having 1–6 conversations amongst each other. During the course of this study, the recognition system hinted on potential labelling errors. We manually listened to the conversations of cliques to check the labelling. We started with the largest cliques of 6 conversations, selecting only those with the biggest log error, and worked our way down in clique size. In most cases labeling errors were very clear, observable from gender, mentioned names and relations, as many cliques seemed to have been formed within families. We corrected the labeling of suspicious cliques once, without further feedback from the speaker recognition and linking system. As a result of various dubious speaker labels, we ended up using 348 conversations involving 356 speakers.

4.1 Evaluation metrics

A natural metric that extends the earlier mentioned entropy considerations is the cross entropy. The cross entropy between the evaluator’s and recognizer’s posteriors for a clique 𝒮{\cal S} with a true partitioning 𝒫{\cal P} is

Hcross\displaystyle H_{\rm cross} =𝒞𝒮P(𝒞𝒫)logP(𝒞X)\displaystyle=-\sum_{{\cal C}\in{\cal S}}P({\cal C}\mid{\cal P})\log P({\cal C}\mid X) (8)
=logP(𝒫X),\displaystyle=-\log P({\cal P}\mid X), (9)

i.e., HcrossH_{\rm cross} can be seen as a logarithmic scoring rule, which is known to be strictly proper [18]. Because HcrossH_{\rm cross} covers M𝒮M_{\cal S} conversations, the average cross entropy per conversation is

H¯cross=HcrossM𝒮.{{\overline{H}}_{\rm cross}}=\frac{H_{\rm cross}}{M_{\cal S}}. (10)

An intuitive meaning of this average entropy is through the perplexity, expH¯cross\exp{{\overline{H}}_{\rm cross}}, as the average number of configurations to choose from for a set of conversations. Along the lines of the Albayzin Language Recognition Evaluation metric [19], we will rather use the confusion

F=eH¯cross1.F=e^{{\overline{H}}_{\rm cross}}-1. (11)

It represents the average number of wrong alternatives. FF is 0 for a system operating perfectly. In Table 1 we have tabulated the reduction of FF based on the prior entropy. In the last row, which represents our channel assignment task, there is just one configuration alternative to the correct one. The goal is to further reduce this to 0 by applying speaker linking.

Finally, we define the clique error rate EE as the average number of cliques for which the maximum posterior configuration is not the true speaker configuration.

4.2 Linking

In Table 2 the results of the speaker attribution experiment for CGN are shown. We have conditioned the performance on the complexity |𝒞|=2M𝒮|{\cal C}|=2^{M_{\cal S}}, the number of configurations in the speaker attribution task, thereby averaging over cliques with the same number of configurations. For a clique of one conversation, |𝒞|=2|{\cal C}|=2, speaker linking cannot resolve the uncertainty, P1,2=0.5P_{1,2}=0.5, and the entropy is 1 bit. There are 8 such speaker pairs that both occur only once in the database, and a further four cases with |𝒞|=4|{\cal C}|=4 where the same speakers are paired twice. We will further discard these unresolvable cases in the analysis.

Table 2: Performance of the linking experiment on CGN, Dutch, component c, grouped by complexity. |𝒞||{\cal C}| is the number of configurations for a clique, N𝒞N_{\cal C} is the number of cliques of this complexity. Entropies are measured in bits.
|𝒞|=2M𝒮|{\cal C}|=2^{M_{\cal S}} N𝒞N_{\rm{\cal C}} H¯cross{{\overline{H}}_{\rm cross}} FF EE (%)
2 8 1.0 1.0 NA
4 61 0.205 0.153 3.3
8 13 0.201 0.150 7.7
16 26 0.032 0.023 7.7
32 9 0.041 0.029 22
64 5 0.019 0.013 20
all resolvable 110 0.078 0.056 7.0

The average cross entropy drops below 1 bit for all complexities above 2, which is an increasing reduction w.r.t. the original entropy H¯=1\bar{H}=1. The confusion drops steadily with increasing complexity, and finally, partitioning error rate is more or less stable at about 7 % on average. We have computed N𝒞N_{\cal C}-weighted averages over the resolvable cliques in the last row of the table.

4.3 Calibration

The scores λ\lambda have been ‘self-calibrated’ using a linear transform λ=as+b\lambda=as+b, where ss is the score and aa and bb are parameters found to minimize the 2-class cross entropy (e.g., CllrC_{\rm llr}) in a classical detection set-up, quite similar to logistic regression. For the posterior (2) the offset bb cancels, so effectively aa is the only parameter of importance in this task.

We want to investigate if the approximation by averaging of log likelihoods as in (5) and the addition of independent evidence as in (6) is not making the system over-confident. Therefore we re-calibrate the scaling factor aa for different clique complexities, and investigate the change in aa and the improvement in average cross entropy per conversation.

Table 3: The effect of calibration. H¯cross{{\overline{H}}_{\rm cross}} is the average cross entropy before re-calibration, cf Table 2. The data are taken only over resolvable cliques.
M𝒮M_{\cal S} H¯cross{{\overline{H}}_{\rm cross}} H¯crossmin{{\overline{H}}_{\rm cross}^{\rm min}} amin/aa_{\rm min}/a
2 0.082 0.071 1.48
3 0.201 0.177 0.65
4 0.032 0.029 1.42
5 0.041 0.040 0.77
6 0.019 0.018 0.79
all 0.078 0.076 1.17

In Table 3 the improved H¯crossmin{{\overline{H}}_{\rm cross}^{\rm min}} for the different complexities is shown, with the additional factor amin/aa_{\rm min}/a applied to the recognizer’s log likelihoods. These values of amin/aa_{\rm min}/a vary a bit, but given the relatively small numbers of cliques that are involved in the optimization, this may be expected. The optimization over all resolvable cliques results in a value amin/a=1.17a_{\rm min}/a=1.17, which is very close to unity. This means that the original calibration based on detection trials only was good, and that the averaging operation (5) and addition of evidence from other links in the clique (6) do not make the posteriors overconfident.

5 Conclusions

We have seen in the channel assignment task that with higher complexity, i.e., higher prior entropy, the average cross entropy per conversation quickly drops towards zero. On the one hand, with increasing complexity the task gets harder because there are more configurations to discriminate between. On the other hand, with higher complexity the increased clique size provides more opportunity for the recognition system to determine the correct configuration with confidence. These two effects more-or-less cancel in the fraction of correctly found configurations. With a very lightweight speaker recognizer, trained with a small amount of domain specific speech material, the channel assignment task can be carried out with a high degree of success. This is indeed a complete Bayesian solution to the speaker partitioning problem as proposed in [2], which is deemed intractable in general. In cases like the CGN database, where conversation clique sizes M𝒮M_{\cal S} are relatively small, the computational load still is negligible (computing the posteriors for all cliques, given all relevant log-likelihood-ratios, takes about 1 second), but of course the load grows exponentially with M𝒮M_{\cal S}. However, it is likely that also in forensic investigations the cliques are small so that we expect that our approach can be used there as well.

References

  • [1] D. A. van Leeuwen, “Speaker linking in large data sets,” in Proc. Odyssey. IEEE, June 2010, pp. 202–208.
  • [2] N. Brümmer and E. de Villiers, “The speaker partitioning problem,” in Proc. Odyssey 2010: The Speaker and Language Recognition Workshop, Brno, June 2010.
  • [3] S. Tranter and D. A. Reynolds, “An overview of automatic speaker diarization systems,” IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1557–1565, 2006.
  • [4] M. Huijbregts, D. van Leeuwen, and F. de Jong, “Speech overlap detection in a two-pass speaker diarization system,” in Proc. Interspeech. Brighton: ISCA, September 2009, pp. 1063–1066.
  • [5] M. Huijbregts and D. A. van Leeuwen, “Large scale speaker diarization for long recordings and small collections,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 2, pp. 404–413, 2012.
  • [6] M. Ferras and H. Bourlard, “Speaker diarization and linking of large corpora,” in Proc. Spoken Language Technology Workshop, Miami, FL, 2012, pp. 280–285.
  • [7] W. M. Campbell and E. Singer, “Query-by-example using speaker content graphs,” in Proc. Interspeech, Portland, 2012, pp. 1095–1098.
  • [8] N. H. J. Oostdijk and D. Broeder, “The Spoken Dutch Corpus and its exploitation environment,” in Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03)., Budapest, Hungary, 2003.
  • [9] D. A. van Leeuwen, J. Kessens, E. Sanders, and H. van den Heuvel, “Results of the N-Best 2008 Dutch speech recognition evaluation,” in Proc. Interspeech. Brighton: ISCA, September 2009, pp. 2571–2574.
  • [10] J. Despres, P. Fousek, J.-L. Gauvain, S. Gay, Y. Josse, L. Lamel, and A. Messaoudi, “Modeling Northern and Southern varieties of Dutch for STT,” in Proc. Interspeech. Brighton: ISCA, 2009, pp. 96–99.
  • [11] M. Huijbregts, R. Ordelman, L. Werff, and F. Jong, “SHoUT, the University of Twente submission to the N-Best 2008 speech recognition evaluation for Dutch,” in Proc. Interspeech. ISCA, 2009, pp. 2575–2578.
  • [12] M. Huijbregts and D. A. van Leeuwen, “Diarization-based speaker retrieval for broadcast television archives,” in Proc. Interspeech. Firenze: ISCA, August 2011.
  • [13] A. Strasheim and N. Brümmer, “SUNSDV system description: NIST SRE 2008,” NIST SRE workshop proceedings, May 2008.
  • [14] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker verification,” in Proc. Speaker Odyssey. Crete, Greece, 2001, pp. 213–218.
  • [15] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains,” IEEE Trans. Speech Audio Processing, vol. 2, pp. 291–298, 1994.
  • [16] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, 2000.
  • [17] D. A. van Leeuwen and N. Brümmer, “The distribution of calibrated likelihood-ratios in speaker recognition,” in Proc. Interspeech. ISCA, 2013, pp. 1619–1623.
  • [18] M. DeGroot and S. Fienberg, “The comparison and evaluation of forecasters,” The Statistician, pp. 12–22, 1983.
  • [19] L. J. Rodríguez-Fuentes, N. Brümmer, M. Penagarikano, A. Varona, G. Bordel, and M. Diez, “The Albayzin 2012 language recognition evaluation,” in Proc. Interspeech, 2013, pp. 1497–1501.