This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation

Jiayu Du, Jinpeng Li, Guoguo Chen, and Wei-Qiang Zhang J. Du and J. Li contributed equally to this work and should be considered co-first author.Jiayu Du, SpeechColab, China (email: jerry.jiayu.du@gmail.com).Jinpeng Li, Department of Electrical Engineering, Tsinghua University, Beijing 100084, China (email: lijp22@mails.tsinghua.edu.cn).Guoguo Chen, Seasalt AI Inc., USA; SpeechColab, China (email: chenguoguo06@gmail.com).Wei-Qiang Zhang, Department of Electrical Engineering, Tsinghua University, Beijing 100084, China; SpeechColab, China (email: wqzhang@tsinghua.edu.cn).This work was supported in part by the National Natural Science Foundation of China under Grant 62276153.
Abstract

In the wake of the surging tide of deep learning over the past decade, Automatic Speech Recognition (ASR) has garnered substantial attention, leading to the emergence of numerous publicly accessible ASR systems that are actively being integrated into our daily lives. Nonetheless, the impartial and replicable evaluation of these ASR systems encounters challenges due to various crucial subtleties. In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. With this platform: (i) We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems, covering both open-source models and industrial commercial services. (ii) We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes. These include nuances related to capitalization, punctuation, interjection, contraction, synonym usage, compound words, etc. These issues have gained prominence in the context of the transition towards an End-to-End future. (iii) We propose a practical modification to the conventional Token-Error-Rate (TER) evaluation metric, with inspirations from Kolmogorov complexity and Normalized Information Distance (NID). This adaptation, called modified-TER (mTER), achieves proper normalization and symmetrical treatment of reference and hypothesis. By leveraging this platform as a large-scale testing ground, this study demonstrates the robustness and backward compatibility of mTER when compared to TER. The SpeechColab Leaderboard is accessible at https://github.com/SpeechColab/Leaderboard.

Index Terms:
Automatic Speech Recognition (ASR), Benchmark, Evaluation Metrics, Word Error Rate, Kolmogorov Complexity

I Introduction

Automatic Speech Recognition (ASR) has been an active research topic for many years. Traditional ASR combines Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM) to capture the dynamics of speech signal and the hierarchical knowledge behind human languages [1]. In recent years, deep neural networks (DNN) have started to emerge with superior accuracy [2], and have quickly become the mainstream for ASR. For instance, chain model [3] incorporates Convolutional Neural Networks (CNNs) and Time Delay Neural Network (TDNNs), while DeepSpeech [4] model utilizes Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs). Modern systems are leaning to even more sophisticated architectures such as Transformer [5] and Conformer [6], coupled with sequence losses like Connectionist Temporal Classification (CTC) [7] and Recurrent Neural Network Transducer (RNN-T) [8]. From a system perspective, driven by the scaling law from language modeling research, large speech models have been developed such as OpenAI-Whisper [9], and Google-USM [10], pushing up the scale of ASR training by orders of magnitude. In the meantime, self-supervised training, as a paradigm shift, is also gaining popularity to leverage abundant unlabeled data in the world. Notable examples are wav2vec 2.0 [11], HuBERT [12], WavLM [13], and data2vec [14].

Given the swift evolution of ASR technology, a variety of speech toolkits have been developed and open-sourced, such as HTK [15], Kaldi [16], ESPnet [17], NeMo [18], SpeechBrain [19], WeNet [20], and K2111https://github.com/k2-fsa, offering comprehensive libraries and recipes to facilitate ASR research and development. However, the evaluation of ASR still remains challenging [21] [22], because there exist various crucial subtleties and pitfalls that require non-trivial efforts to do right in practice, such as text normalization [23]. The divergent ecosystem struggles to reach a clear and consistent understanding on the performance of modern ASR systems.

To address the problem, we present SpeechColab Leaderboard, an open-source benchmark platform, so that speech researchers and developers can reliably reproduce, examine, and compare all kinds of ASR systems. The platform is designed to be: (i) Simple: consistent data formats and unified interfaces minimize accidental complexity. (ii) Open: leaderboard users should be able to easily share and exchange resources (e.g. test sets, models, configurations). (iii) Reproducible: ASR systems, including all their dependencies and environment details, should be reproducible as a whole.

In Section 2, we describe the proposed platform, including three major components: a dataset zoo, a model zoo, and an evaluation pipeline. In Section 3, we report a large-scale benchmark for English ASR on the platform. In Section 4, the traditional evaluation metric TER (Token Error Rate) is briefly revisited, and a simple and practical modification is proposed to make TER more robust.

Refer to caption
Figure 1: SpeechColab Leaderboard
TABLE I: Dataset Zoo
Dataset Number of sentences Total duration Source Style Release Date
LibriSpeech.test-clean [24] 2620 5.403 hours the LibriVox project narrated audio-books 2015
LibriSpeech.test-other [24] 2939 5.342 hours the LibriVox project narrated audio-books 2015
TEDLIUM3.dev [25] 507 1.598 hours TED talks oratory 2018
TEDLIUM3.test [25] 1155 2.617 hours TED talks oratory 2018
GigaSpeech.dev [26] 5715 11.366 hours Podcast and YouTube spontaneous 2021
GigaSpeech.test [26] 19930 35.358 hours Podcast and YouTube spontaneous 2021
VoxPopuli.dev [27] 1753 4.946 hours European Parliament spontaneous 2021
VoxPopuli.test [27] 1841 4.864 hours European Parliament spontaneous 2021
VoxPopuli.test-accented [27] 8357 26.174 hours European Parliament spontaneous 2022
CommonVoice11.0.dev [28] 16352 27.245 hours crowd sourcing narrated prompts 2022
CommonVoice11.0.test [28] 16351 26.950 hours crowd sourcing narrated prompts 2022
TABLE II: Model Zoo
Type Model Architecture Size in bytes Date of Evaluation
Commercial API aliyun_api_en 2022.10
amazon_api_en 2022.10
baidu_api_en 2022.10
google_api_en 2022.10
google_USM_en [10] 2023.03
microsoft_sdk_en 2022.10
tencent_api_en 2022.10
Open-Source (Supervised) vosk_model_en_large Kaldi chain model 2.7G 2022.10
deepspeech_model_en [4] RNN + N-gram 1.1G 2022.10
coqui_model_en RNN + N-gram 979M 2022.10
nemo_conformer_ctc_large_en [18] Conformer-CTC 465M 2022.10
nemo_conformer_transducer_xlarge_en [18] Conformer-Transducer 2.5G 2022.10
k2_gigaspeech [29] Pruned stateless RNN-T 320M 2022.10
whisper_large_v1 [9] Transformer Encoder-Decoder 2.9G 2022.10
whisper_large_v2 [9] Transformer Encoder-Decoder 2.9G 2023.03
Open-Source (Unsupervise + Fine-tuned) data2vec_audio_large_ft_libri_960h data2vec [14] 1.2G 2022.10
hubert_xlarge_ft_libri_960h HuBERT [12] 3.6G 2022.10
wav2vec2_large_robust_ft_libri_960h wav2vec 2.0 [11] 2.4G 2022.10
wavlm_base_plus_ft_libri_clean_100h WavLM [13] 361M 2022.10

* Some models do not provide interfaces to query the total number of parameters, so we list model sizes in bytes in the table for consistency.

II The Platform

II-A Overview

The overall platform is shown in Figure 1. We implement the dataset zoo and the model zoo on top of commercial cloud-storage services, so that the platform can serve as a reliable and high-speed disk for our users to exchange data and models. Evaluation sets and models are associated with globally unique identifiers. Within our repository, to initiate a benchmark:

ops/benchmark -m <MODEL_ID> -d <DATASET_ID>

II-B Dataset Zoo

In the dataset zoo: (i) All utterances, if necessary, are extracted from raw long audio, and stored consistently in short222The utterances longer than 60s are removed. WAV format. (ii) The metadata is organized in tab-separated-values (.tsv) format with 4 columns, as shown in Table III.

TABLE III: Unified metadata.tsv
ID AUDIO DURATION TEXT
POD0000051 audio/POD0000051.wav 2.100 But what kind of business?
POD0000094 audio/POD0000094.wav 2.727 So we’re gonna make it …

As of the writing of this paper, 11 well-known ASR evaluation sets are processed and integrated into the dataset zoo, summarized in Table I. We provide simple operational utilities for dataset management. For example, to share a local test set to the zoo:

    ops/push -d <DATASET_ID>

And to retrieve a test set from the zoo:

    ops/pull -d <DATASET_ID>

II-C Model Zoo

We define a simple and unified interface for model zoo: all models should be built within a Docker container along with an ASR program that takes a list of WAV files as input. By design, we support both open-source models and commercial API services. Similar to the dataset zoo, leaderboard users can easily publish or reproduce ASR systems by using:

    ops/push -m <MODEL_ID>
    ops/pull -m <MODEL_ID>

Table II provides a detailed list of 19 integrated models.

II-D Evaluation Pipeline

TABLE IV: Text Normalization Components
Pipeline components Description Raw text New text
CASE unify cases And then there was Broad Street. AND THEN THERE WAS BROAD STREET.
PUNC remove punctuations (, . ? ! ” - and single quote) ”’He doesnt say exactly what it is,’ said Ruth, a little dubiously. ” He doesnt say exactly what it is said Ruth a little dubiously
ITJ remove interjections uh yeah um that’s good yeah that’s good
UK-US unify UK & US spelling conventions 1. she went to the theatre 2. such a humour 3. I apologise 1. she went to the theater 2. such a humor 3. I apologize
NSW normalize Non-Standard-Word (number, quantity, date, time, address etc) 1. gave him $100. 2. Just before 8.30 a.m. 3. grew up in the 1980s 4. the baggage is 12.7kg 5. in the 21st century 6. 1/3 of the population 7. 13,000 people 8. 1998/2/30 1. gave him one hundred dollars. 2. Just before eight thirty AM 3. grew up in the nineteen eighties 4. the baggage is twelve point seven kilograms 5. in the twenty first century 6. one third of the population 7. thirteen thousand people 8. february thirtieth nineteen ninety eight

II-D1 Preprocessing

In practice, references and hypotheses are usually in different forms, therefore pre-processing is needed to remove the discrepancy. As listed in table IV, we deal with case, punctuation, common interjections, US/UK spelling convention333http://www.tysto.com/uk-us-spelling-list.html etc. For Non-Standard-Word (NSW) normalization, we leverage the context-dependent rewriting rules from NeMo toolkit [30]. 444We currently do not have reliable solution for stutter and dis-fluency detection, this could be one of our future works.

TABLE V: Dynamic Alternative Expansion(DAE) (hypothesis only)
Pipeline components Description Alternative Set Raw hyp hyp after DAE
DAE contractions, abbreviations, compounds etc… We’re = We are I’m = I am, gonna = going to, OK = O K = Okey storyteller = story-teller = story teller 1.We’re here early 2.I’m gonna be OK 3.He is an excellent storyteller 1.(We’re|We are) here early 2.(I’m|I am) (gonna|going to) be (OK|O K|Okay) 3.He is an excellent (storyteller|story teller|story-teller)

II-D2 Metric

Token Error Rate is used to evaluate the accuracy of ASR systems:

TER(ref,hyp)𝒟(refhyp)|ref|\begin{split}{\rm TER}(ref,hyp)&\doteq\frac{\mathcal{LD}(ref\to hyp)}{|ref|}\end{split} (1)

where 𝒟\mathcal{LD} denotes the Levenshtein Distance, and |ref||ref| refers to the number of words (for English ASR) in the reference.

II-D3 Scoring

The scoring module is implemented in Weighted Finite State Transducer (WFST) framework, in particular via OpenFST [31] and Pynini [32]. The pre-processed references and hypotheses are first tokenized as word sequences, then transformed to linear FSTs. The Levenshtein transducer \mathcal{L} is constructed with the standard operations and costs: insertion (INS:1.0), deletion (DEL:1.0), substitution (SUB:1.0), and correct-match (COR:0.0). Hence the Levenshtein distance in the numerator can be calculated as:

𝒟(refhyp)=cost_of_shortest_path(refhyp)\mathcal{LD}(ref\to hyp)=\rm{cost\_of\_shortest\_path}(ref\circ\mathcal{L}\circ hyp)

where \circ denotes FST composition. Note that in large vocabulary continuous speech recognition (LVCSR), the static size of \mathcal{L} could explode because the substitution requires O(V2)O(V^{2}) space, where VV refers to the vocabulary size. We follow the optimization practice in [33] by leveraging auxiliary symbols, so that \mathcal{L} is factored as the product of two smaller FSTs.

II-D4 Dynamic Alternative Expansion (DAE)

Similar to NIST’s GLM machnism555https://github.com/usnistgov/SCTK/blob/master/doc/GLMRules.txt, we enhance standard Levenshtein distance algorithm to support customizable alternative sets to deal with synonyms, contractions, abbreviations, compound words, etc on-the-fly.

As illustrated in Figure 2. hyphyp is transformed from the original linear structure to a sausage-like FST, so that no matter which alternative appears in refref, the expanded hyphyp can correctly match up with it. Note that we mark the expanded alternative paths with auxiliary hash tags, and Levenshtein transducer \mathcal{L} is modified accordingly to disallow partial matches of these expanded segments. Also note that dynamic alternative expansion (DAE) is only applied to the hyphyp, whereas the refref remains intact.

Refer to caption
Figure 2: Levenshtein distance with Dynamic Alternative Expansion (DAE)
TABLE VI: Benchmark Results WER(in %) and Ranking (quoted)
Model

LibriSpeech.test-clean

LibriSpeech.test-other

TEDLIUM3.dev

TEDLIUM3.test

GigaSpeech.dev

GigaSpeech.test

VoxPopuli.dev

VoxPopuli.test

VoxPopuli.test-accented

CommonVoice11.0.dev

CommonVoice11.0.test

aliyun_api_en 4.34 (12) 10.01 (12) 5.79 (12) 5.53 (12) 12.63 (11) 12.83 (11) 11.93 (9) 11.50 (9) 15.61 (5) 15.67 (9) 17.86 (9)
amazon_api_en 6.42 (16) 13.20 (15) 5.11 (8) 4.74 (9) 11.31 (8) 11.71 (8) 12.14 (11) 11.84 (11) 14.56 (4) 22.66 (15) 26.25 (16)
baidu_api_en 6.61 (17) 14.69 (16) 8.65 (14) 7.93 (14) 16.94 (13) 16.80 (13) 14.55 (15) 14.08 (15) 19.98 (11) 22.74 (16) 25.97 (15)
google_api_en 5.63 (15) 12.56 (14) 5.54 (10) 5.26 (10) 11.70 (9) 11.78 (9) 11.95 (10) 11.77 (10) 15.72 (6) 17.49 (10) 20.60 (10)
google_USM_en 2.13 (5) 4.35 (5) 3.69 (3) 3.05 (1) 8.67 (2) 9.23 (3) 9.89 (8) 9.50 (8) 14.18 (3) 8.69 (5) 10.19 (4)
microsoft_sdk_en 3.03 (10) 6.65 (10) 3.65 (2) 3.14 (2) 8.74 (3) 9.02 (1) 9.04 (5) 8.88 (5) 12.17 (1) 8.86 (6) 10.46 (5)
tencent_api_en 3.54 (11) 7.26 (11) 4.94 (7) 4.20 (5) 10.19 (6) 10.37 (6) 9.21 (6) 8.90 (6) 15.81 (7) 11.59 (7) 12.91 (7)
vosk_model_en_large 5.17 (14) 12.36 (13) 5.74 (11) 5.41 (11) 13.84 (12) 14.08 (12) 13.22 (12) 12.52 (12) 20.27 (12) 21.66 (14) 25.46 (14)
deepspeech_model_en 6.95 (18) 21.23 (19) 17.71 (18) 18.31 (18) 32.68 (17) 31.21 (17) 28.46 (19) 29.10 (19) 38.23 (19) 43.34 (19) 47.80 (19)
coqui_model_en 4.94 (13) 15.92 (18) 17.48 (17) 18.66 (19) 34.62 (18) 32.45 (18) 27.60 (18) 27.66 (18) 37.85 (18) 35.52 (17) 40.10 (17)
nemo_conformer_ctc_large_en 1.95 (4) 4.26 (4) 5.11 (8) 4.69 (8) 11.70 (9) 12.01 (10) 6.68 (2) 6.61 (2) 18.24 (8) 8.44 (3) 9.23 (2)
nemo_conformer_transducer_xlarge_en 1.36 (1) 2.76 (1) 4.61 (5) 4.31 (6) 10.85 (7) 11.53 (7) 6.05 (1) 5.97 (1) 20.80 (14) 5.20 (1) 5.82 (1)
k2_gigaspeech 2.18 (7) 5.30 (8) 3.42 (1) 3.40 (3) 8.90 (4) 9.71 (5) 9.80 (7) 9.48 (7) 12.46 (2) 14.05 (8) 17.04 (8)
whisper_large_v1 2.66 (9) 5.81 (9) 4.90 (6) 4.31 (6) 9.11 (5) 9.68 (4) 8.31 (4) 7.88 (4) 20.80 (14) 8.68 (4) 10.89 (6)
whisper_large_v2 2.14 (6) 4.65 (6) 3.70 (4) 3.55 (4) 8.53 (1) 9.19 (2) 7.82 (3) 7.17 (3) 18.83 (9) 7.88 (2) 10.01 (3)
data2vec_audio_large_ft_libri_960h 1.64 (2) 3.85 (3) 8.31 (13) 7.66 (13) 18.43 (14) 17.92 (14) 13.73 (13) 13.42 (13) 19.45 (10) 18.25 (12) 21.75 (11)
hubert_xlarge_ft_libri_960h 1.79 (3) 3.48 (2) 8.94 (15) 8.14 (15) 18.87 (15) 18.43 (15) 14.47 (14) 13.62 (14) 20.58 (13) 18.05 (11) 21.79 (12)
wav2vec2_large_robust_ft_libri_960h 2.46 (8) 5.18 (7) 9.66 (16) 9.02 (16) 19.80 (16) 18.81 (16) 14.99 (16) 14.76 (16) 21.00 (16) 18.93 (13) 22.35 (13)
wavlm_base_plus_ft_libri_clean_100h 7.00 (19) 15.68 (17) 18.75 (19) 18.21 (17) 34.78 (19) 33.02 (19) 21.55 (17) 21.16 (17) 27.94 (17) 40.17 (18) 44.96 (18)

There are two reasons for dynamic alternative expansion instead of processing them directly in the text normalization stage. Firstly, specifying a canonical form out of these alternatives is controversial, and we as platform developers are more interested in providing DAE as a flexible mechanism rather than ironing out some specific TN (Text Normalization) policies. Secondly, with hyphyp-only expansion, the algorithm will always honor the raw form of human labels (i.e. refref), so that alternative sets can evolve without mutating the denominator of TER (i.e. |ref||ref|), which is a desirable property for comparison purpose over time.

III The Benchmark

III-A Results

Table VI shows Word Error Rates (WER) and the ranking covering all models and test sets on the platform:

1) Given DeepSpeech being the SOTA back in 2014, OpenAI-Whisper model now has achieved remarkable relative WER reductions over DeepSpeech: 78% on LibriSpeech-test-other, 81% on TedLium3-test, 71% on GigaSpeech-test, 75% on VoxPopuli-test, 50% on VoxPopuli-accented, 79% on CommonVoice-test. These numbers reflect the overall progress of ASR over the last decade.

2) Open-source models tend to beat commercial services on LibriSpeech by a large margin, which reveals some weird situations relating to the highly influential LibriSpeech dataset: (i) The LibriSpeech benchmark may not be a preferable indicator for real-life ASR, therefore commercial services assign little efforts to the task in their productions. (ii) On the other hand, open-source models apparently prioritize LibriSpeech since it is the de facto standard in ASR research, but these systems often generalize poorly to other tasks. The effort in pushing the LibriSpeech SOTA may turn out to be an overfitting game. (iii) Similarly, even strong self-supervised pretrained models (such as data2vec, hubert, wav2vec and wavlm) fail to show strong generalization ability if finetuned on LibriSpeech only.

3) As another manifestation of the scaling law, large speech models, such as Whisper, USM, and xlarge NeMo transducer, are shown to be highly competitive even compared with the best commercial services. More importantly, some of these large ASR systems are open-sourced. Therefore in a foreseeable future, as these large models getting better and hardware getting cheaper, some of the traditional ASR providers need to seek compelling reasons for customers to pay.

III-B Scoring subtleties

There are various subtleties in the scoring pipeline that can affect benchmark results, such as punctuation, interjection, non-standard word normalization, US/UK spellings, etc. To study the importance of these subtleties, an ablation experiment is conducted by turning off each component respectively, as shown in columns (A1-A5) from table VII. 666 Note that we use GigaSpeech test set for this study rather than LibriSpeech, since GigaSpeech is curated mainly from real-world scenarios (YouTube and podcasts) on daily topics in spontaneous conversations.

TABLE VII: Effects of turning-off single component (on GigaSpeech-Test)
A0 (baseline) A1 A2 A3 A4 A5
Evaluation Pipeline Component (crossed = turned off) CASE
PUNC
ITJ
UK-US
NSW
DAE
WER(%) (Rank) microsoft_sdk_en 9.02 (1) 9.81 (1) 9.68 (1) 9.05 (1) 9.02 (1) 10.46 (1)
whisper_large_v2 9.19 (2) 22.43 (14) 9.91 (3) 9.22 (2) 10.88 (4) 10.57 (2)
google_USM_en 9.23 (3) 9.85 (2) 9.85 (2) 9.26 (3) 10.77 (3) 10.75 (3)
whisper_large_v1 9.68 (4) 22.62 (15) 10.39 (5) 9.71 (4) 11.34 (5) 11.12 (5)
k2_gigaspeech 9.71 (5) 10.34 (3) 10.36 (4) 9.74 (5) 9.72 (2) 11.06 (4)
tencent_api_en 10.37 (6) 20.88 (13) 10.94 (6) 10.39 (6) 11.42 (6) 11.70 (6)
nemo_conformer_transducer_xlarge_en 11.53 (7) 12.35 (4) 12.22 (7) 11.57 (7) 11.53 (7) 12.78 (7)
amazon_api_en 11.71 (8) 23.78 (16) 12.30 (8) 11.73 (8) 13.30 (11) 13.05 (8)
google_api_en 11.78 (9) 12.41 (5) 12.48 (9) 11.80 (9) 13.29 (10) 13.32 (10)
nemo_conformer_ctc_large_en 12.01 (10) 12.81 (6) 12.62 (10) 12.04 (10) 12.01 (8) 13.27 (9)
aliyun_api_en 12.83 (11) 13.71 (7) 13.44 (11) 12.86 (11) 12.83 (9) 13.98 (11)
vosk_model_en_large 14.08 (12) 14.84 (8) 14.68 (12) 14.10 (12) 14.08 (12) 15.47 (12)
baidu_api_en 16.80 (13) 17.56 (9) 17.43 (13) 16.83 (13) 16.81 (13) 17.98 (13)
data2vec_audio_large_ft_libri_960h 17.92 (14) 18.68 (10) 18.13 (14) 17.99 (14) 17.92 (14) 18.85 (14)
hubert_xlarge_ft_libri_960h 18.43 (15) 19.19 (11) 18.61 (15) 18.50 (15) 18.43 (15) 19.33 (15)
wav2vec2_large_robust_ft_libri_960h 18.81 (16) 19.56 (12) 18.93 (16) 18.89 (16) 18.81 (16) 19.73 (16)
deepspeech_model_en 31.21 (17) 31.70 (17) 31.73 (17) 31.24 (17) 31.21 (17) 32.04 (17)
coqui_model_en 32.45 (18) 33.12 (18) 32.72 (18) 32.49 (18) 32.45 (18) 33.27 (18)
wavlm_base_plus_ft_libri_clean_100h 33.02 (19) 33.74 (19) 33.06 (19) 33.07 (19) 33.02 (19) 33.42 (19)

* Microsoft API provides a switch to disable Inverse Text Normalization(ITN), so its number in A4 (turning off NSW normalization in our pipeline) is basically the same as the baseline.

III-B1 PUNC

The comparison between A1 and A0 reminds us that the processing of punctuation is vital. By turning off the punctuation removal component, we see a dramatic increase of WER for Whisper, tencent, amazon. This is because different ASR systems do have quite distinctive policies for punctuation. Therefore it requires serious and careful treatment, otherwise the whole evaluation can be completely biased and dominated by punctuation-related errors. Also note that proper processing of punctuation is non-trivial: consider things like single quotes mixed with apostrophe in contraction, comma and period in numbers, hyphen in compound words, etc.

III-B2 ITJ

Comparing A2 and A0, we see the processing of interjections can result in absolute WER changes around 0.5% \sim 1.0%. The difference seems to be mild, until we realize these numbers translate to a relative ~10% WER increase. Therefore the removal of interjections is by no means a factor that we can ignore in ASR research and evaluation on conversational ASR tasks. The implementation is trivial though, given a list of common interjections such as ’uh’, ’eh’, ’um’. 777Our platform maintains common interjections at: https://github.com/SpeechColab/Leaderboard/blob/master/utils/interjections_en.csv

III-B3 UK-US

Comparing A3 and A0, we can see that the unification of US-UK spelling convention has minor effects.

III-B4 NSW and DAE

Comparing A4, A5 with A0, it is apparent that NSW normalization and DAE are both crucial 888NSW component deals with the normalization of numbers, quantities, date, money, and times, etc, and DAE module is responsible for contractions, abbreviations, and compound words.. Traditional ASR systems tend to generate spoken-form recognition results, whereas commercial services and modern end-to-end systems (such as Whisper) normally yield the written-form as default. The discrepancy between spoken and written forms is so significant, that any serious benchmark should dive into these details to make sure the normalized hypotheses and references are consistent, otherwise the benchmark can become pointless. However, robust text normalization (TN) is challenging and immensely tedious due to numerous rules and long-tail exceptions. Therefore, we highly suggest other benchmark developers to leverage and improve existing tools instead of building new TN from scratch.

Refer to caption
Figure 3: Results of ablation experiments on GigaSpeech.test dataset

In Figure 3, we show how these individual components are stacked together. As can be seen: i) The ranking can vary from the left (a barely working naive pipeline) to the right (a sophisticated pipeline). ii) The sophisticated pipeline can have a relative 20% \sim 30% WER reduction over the naive pipeline. These observations raise a concerning revelation that technological progress can be easily shadowed by implementation details of the scoring pipeline. For this reason, when we are discussing the absolute number of WER and the rankings of different systems, the specific toolkit and pipeline in use should be deliberately brought into the context, to avoid misunderstanding and misinformation.

IV Modified Token Error Rate (mTER)

For decades, Token Error Rate (TER) has been the standard evaluation metric of Automatic Speech Recognition (ASR). In this section, we propose a new metric to address some fundamental issues of TER.

IV-A Problems with TER

IV-A1 TER violates the metric-space axioms

In mathematics, a metric space needs to satisfy a small set of requirements called the metric (in)equalities999 Consisting of following axioms: 1) positivity: D(x,y)0D(x,y)\geq 0; 2) identity: D(x,y)=0D(x,y)=0 if and only if x=yx=y; 3) symmetry: D(x,y)=D(y,x)D(x,y)=D(y,x); 4) triangle-inequality: D(x,y)D(x,z)+D(z,y)D(x,y)\leq D(x,z)+D(z,y) . Obviously, TER violates the symmetry axiom, because:

TER(ref,hyp)TER(hyp,ref){\rm TER}(ref,hyp)\neq{\rm TER}(hyp,ref)

This has broad impacts in practice: For industry, tools and pipelines need to implement asymmetric interfaces and options, so as documentations; For education, instructors need to emphasis the denominator calculation to their students, as well as the relativity of insertion (INS) versus deletion (DEL).

IV-A2 TER is ill-normalized

TER picks |ref||ref| as the normalizer, which makes it numerically unbounded (can easily exceed 100%). This becomes more confusing since it is literally called error rate.

These problems have been identified and reported in the prior literature, and new metrics are proposed in [34, 35]. Besides fixing above issues, these new metrics tend to incorporate some sort of informational/cognitive measure into ASR evaluation, which is generally the right call. Unfortunately, a new metric needs to fight against strong inertia since the main purpose of a metric is to serve comparisons over time. As a result, these backward-incompatible metrics never receive widespread adoptions.

TABLE VIII: The comparison of (TER/mTER)
Model

LibriSpeech.test-clean

LibriSpeech.test-other

TEDLIUM3.dev

TEDLIUM3.test

GigaSpeech.dev

GigaSpeech.test

VoxPopuli.dev

VoxPopuli.test

VoxPopuli.test-accented

CommonVoice11.0.dev

CommonVoice11.0.test

aliyun_api_en 4.34/4.334.34/4.33 10.01/9.9910.01/9.99 5.79/5.795.79/5.79 5.53/5.535.53/5.53 12.63/12.6312.63/12.63 12.83/12.8312.83/12.83 11.93/11.9311.93/11.93 11.50/11.4911.50/11.49 15.61/15.6115.61/15.61 15.67/15.5115.67/15.51 17.86/17.6717.86/17.67
amazon_api_en 6.42/6.416.42/6.41 13.20/13.2013.20/13.20 5.11/5.115.11/5.11 4.74/4.744.74/4.74 11.31/11.3111.31/11.31 11.71/11.7111.71/11.71 12.14/12.1312.14/12.13 11.84/11.8411.84/11.84 14.56/14.5614.56/14.56 22.66/22.5622.66/22.56 26.25/26.2526.25/26.25
baidu_api_en 6.61/6.586.61/6.58 14.69/14.6514.69/14.65 8.65/8.658.65/8.65 7.93/7.937.93/7.93 16.94/16.9416.94/16.94 16.80/16.8016.80/16.80 14.55/14.5514.55/14.55 14.08/14.0814.08/14.08 19.98/19.9819.98/19.98 22.74/22.6022.74/22.60 25.97/25.9725.97/25.97
google_api_en 5.63/5.615.63/5.61 12.56/12.5312.56/12.53 5.54/5.545.54/5.54 5.26/5.265.26/5.26 11.70/11.7011.70/11.70 11.78/11.7811.78/11.78 11.95/11.9311.95/11.93 11.77/11.7611.77/11.76 15.72/15.6815.72/15.68 17.49/17.3117.49/17.31 20.60/20.5820.60/20.58
google_USM_en 2.13/2.132.13/2.13 4.35/4.354.35/4.35 3.69/3.693.69/3.69 3.05/3.053.05/3.05 8.67/8.608.67/8.60 9.23/9.239.23/9.23 9.89/9.899.89/9.89 9.50/9.509.50/9.50 14.18/14.1814.18/14.18 8.69/8.698.69/8.69 10.19/10.1910.19/10.19
microsoft_sdk_en 3.03/3.023.03/3.02 6.65/6.646.65/6.64 3.65/3.653.65/3.65 3.14/3.143.14/3.14 8.74/8.698.74/8.69 9.02/9.029.02/9.02 9.04/9.009.04/9.00 8.88/8.848.88/8.84 12.17/12.0912.17/12.09 8.86/8.848.86/8.84 10.46/10.4610.46/10.46
tencent_api_en 3.54/3.543.54/3.54 7.26/7.267.26/7.26 4.94/4.944.94/4.94 4.20/4.204.20/4.20 10.19/10.1910.19/10.19 10.37/10.3710.37/10.37 9.21/9.219.21/9.21 8.90/8.908.90/8.90 15.81/15.8115.81/15.81 11.59/11.5211.59/11.52 12.91/12.9112.91/12.91
vosk_model_en_large 5.17/5.165.17/5.16 12.36/12.3612.36/12.36 5.74/5.745.74/5.74 5.41/5.415.41/5.41 13.84/13.8413.84/13.84 14.08/14.0814.08/14.08 13.22/13.0813.22/13.08 12.52/12.4112.52/12.41 20.27/19.6120.27/19.61 21.66/21.2921.66/21.29 25.46/25.4225.46/25.42
deepspeech_model_en 6.95/6.956.95/6.95 21.23/21.2321.23/21.23 17.71/17.7117.71/17.71 18.31/18.3118.31/18.31 32.68/32.6832.68/32.68 31.21/31.2131.21/31.21 28.46/28.4628.46/28.46 29.10/29.1029.10/29.10 38.23/38.2338.23/38.23 43.34/43.3443.34/43.34 47.80/47.8047.80/47.80
coqui_model_en 4.94/4.934.94/4.93 15.92/15.9215.92/15.92 17.48/17.4817.48/17.48 18.66/18.6618.66/18.66 34.62/34.6234.62/34.62 32.45/32.4532.45/32.45 27.60/27.6027.60/27.60 27.66/27.6627.66/27.66 37.85/37.8537.85/37.85 35.52/35.5235.52/35.52 40.10/40.1040.10/40.10
nemo_conformer_ctc_large_en 1.95/1.951.95/1.95 4.26/4.264.26/4.26 5.11/5.105.11/5.10 4.69/4.694.69/4.69 11.70/11.6811.70/11.68 12.01/12.0112.01/12.01 6.68/6.686.68/6.68 6.61/6.616.61/6.61 18.24/18.2418.24/18.24 8.44/8.448.44/8.44 9.23/9.239.23/9.23
nemo_conformer_transducer_xlarge_en 1.36/1.361.36/1.36 2.76/2.762.76/2.76 4.61/4.614.61/4.61 4.31/4.314.31/4.31 10.85/10.8510.85/10.85 11.53/11.5311.53/11.53 6.05/6.056.05/6.05 5.97/5.975.97/5.97 20.80/20.8020.80/20.80 5.20/5.205.20/5.20 5.82/5.825.82/5.82
k2_gigaspeech 2.18/2.182.18/2.18 5.30/5.305.30/5.30 3.42/3.423.42/3.42 3.40/3.403.40/3.40 8.90/8.908.90/8.90 9.71/9.719.71/9.71 9.80/9.809.80/9.80 9.48/9.489.48/9.48 12.46/12.4612.46/12.46 14.05/14.0514.05/14.05 17.04/17.0417.04/17.04
whisper_large_v1 2.66/2.662.66/2.66 5.81/5.805.81/5.80 4.90/4.904.90/4.90 4.31/4.314.31/4.31 9.11/9.119.11/9.11 9.68/9.689.68/9.68 8.31/8.318.31/8.31 7.88/7.877.88/7.87 20.80/20.8020.80/20.80 8.68/8.648.68/8.64 10.89/10.8210.89/10.82
whisper_large_v2 2.14/2.142.14/2.14 4.65/4.654.65/4.65 3.70/3.703.70/3.70 3.55/3.553.55/3.55 8.53/8.538.53/8.53 9.19/9.199.19/9.19 7.82/7.827.82/7.82 7.17/7.177.17/7.17 18.83/18.8318.83/18.83 7.88/7.867.88/7.86 10.01/9.9610.01/9.96
data2vec_audio_large_ft_libri_960h 1.64/1.641.64/1.64 3.85/3.853.85/3.85 8.31/8.268.31/8.26 7.66/7.667.66/7.66 18.43/18.2718.43/18.27 17.92/17.9217.92/17.92 13.73/13.6613.73/13.66 13.42/13.3413.42/13.34 19.45/19.0719.45/19.07 18.25/18.1718.25/18.17 21.75/21.6321.75/21.63
hubert_xlarge_ft_libri_960h 1.79/1.791.79/1.79 3.48/3.483.48/3.48 8.94/8.858.94/8.85 8.14/8.148.14/8.14 18.87/18.7018.87/18.70 18.43/18.4318.43/18.43 14.47/14.3914.47/14.39 13.62/13.5313.62/13.53 20.58/20.0320.58/20.03 18.05/17.9518.05/17.95 21.79/21.6321.79/21.63
wav2vec2_large_robust_ft_libri_960h 2.46/2.462.46/2.46 5.18/5.185.18/5.18 9.66/9.529.66/9.52 9.02/8.979.02/8.97 19.80/19.4519.80/19.45 18.81/18.7518.81/18.75 14.99/14.8314.99/14.83 14.76/14.6014.76/14.60 21.00/20.1921.00/20.19 18.93/18.8618.93/18.86 22.35/22.2222.35/22.22
wavlm_base_plus_ft_libri_clean_100h 7.00/7.007.00/7.00 15.68/15.6815.68/15.68 18.75/18.6418.75/18.64 18.21/18.2118.21/18.21 34.78/34.7834.78/34.78 33.02/33.0233.02/33.02 21.55/21.4821.55/21.48 21.16/21.0721.16/21.07 27.94/27.2427.94/27.24 40.17/40.1740.17/40.17 44.96/44.9644.96/44.96
Figure 4: An example showing the difference in values between mTER and TER.
----------------------------------------------------- an example -----------------------------------------------------
{"uid":YOU1000000117_S0000168, "TER":76.92, "mTER":43.48, "cor":13,"sub":0, "ins":10, "del":0}
  REF  : FOR OLDER KIDS THAT CAN BE THE SAME *   WE DO IT AS ADULTS *   *    *           *     *   *   *    *   *
  HYP  : FOR OLDER KIDS THAT CAN BE THE SAME WAY WE DO IT AS ADULTS FOR MORE INFORMATION VISIT WWW DOT FEMA DOT GOV
  EDIT :                                     I                      I   I    I           I     I   I   I    I   I
----------------------------------------------------------------------------------------------------------------------

IV-B mTER

In [36], Paul M. B. Vitányi et al proposed a universal similarity metric based on Kolmogorov complexity, called Normalized Information Distance (NID):

NID(x,y)max{K(x|y),K(y|x)}max{K(x),K(y)}=max{K(x|y),K(y|x)}max{K(x|ϵ),K(y|ϵ)}\begin{split}{\rm NID}(x,y)&\doteq\frac{\max\{K(x|y),K(y|x)\}}{\max\{K(x),K(y)\}}\\ &=\frac{\max\{K(x|y),K(y|x)\}}{\max{\{K(x|\epsilon),K(y|\epsilon)\}}}\end{split} (2)

Due to the incomputability of Kolmogorov complexity (subjected to Turing’s halting problem), NID can only serve as a theoretical framework. Some practical approximations of NID, such as Normalized Compression Distance (NCD) [37], Normalized Web Distance (NWD) have been investigated for wide range of tasks including genome phylogeny [38], plagiarism detection [39] and information retrieval [40], etc.

The similarity between Conditional Kolmogorov complexity and Levenshtein distance is striking. Since K(O|I)K(O|I) depicts the theoretical minimal program length to generate output OO given input II (both encoded in bit-sequences) within a Universal Turing Machine, whereas 𝒟(IO)\mathcal{LD}(I\to O) takes a more pragmatic approach by counting the minimal edit instructions required to transform II into OO (both encoded as token sequences) within a heavily constrained instruction set {INS, DEL, SUB}. In the scope of this paper, it is not necessary to establish more formal connection between KK and 𝒟\mathcal{LD} though 101010 Actually, the algorithmic information framework can provide deeper insights. For instance: The “INS” and “DEL” instructions should by no means be equally weighted from an information perspective, because it requires immensely more complexity for DaVinci to create the “Mona Lisa” than someone to erase it, i.e.: K(MonoLisa|ϵ)K(ϵ|MonoLisa)K(MonoLisa|\epsilon)\gg K(\epsilon|MonoLisa). So in principle, more cognitive-aware metrics can be designed following the algorithmic information theory. , we denote it as a simple heuristic:

K(O|I)𝒟(IO)K(O|I)\Leftrightarrow\mathcal{LD}(I\to O) (3)

Given this heuristic, now observing the definition of NID (especially pay attention to the form of the denominator as the metric normalizer), we propose modified Token Error Rate:

mTER(ref,hyp)max{𝒟(hypref),𝒟(refhyp)}max{𝒟(ϵref),𝒟(ϵhyp)}=𝒟(refhyp)max{|ref|,|hyp|}\begin{split}&{\rm mTER}(ref,hyp)\\ &\doteq\frac{\max\{\mathcal{LD}(hyp\to ref),\mathcal{LD}(ref\to hyp)\}}{\max\{\mathcal{LD}(\epsilon\to ref),\mathcal{LD}(\epsilon\to hyp)\}}\\ &=\frac{\mathcal{LD}(ref\to hyp)}{\max\{|ref|,|hyp|\}}\end{split} (4)

Note that the above numerator simplification is based on the fact that 𝒟\mathcal{LD} is symmetric with regard to refref and hyphyp.

Compared to the vanilla TER in equation (1), it is merely a modification to the denominator: |ref|max{|ref|,|hyp|}|ref|\Rightarrow\max\{|ref|,|hyp|\}. However, the resulting mTER is symmetric, gracefully normalized and bounded between [0%,100%][0\%,100\%], and perfectly conforms to the aforementioned metric-space axioms.

IV-C Empirical study on mTER and TER

IV-C1 Dataset-Level comparison

Table VIII compares TER and mTER on all models and test sets on the platform. About 66% of the mTER numbers are exact same as TER. And for the rest, WER discrepancies are minor (<4% relatively). That basically means mTER is backward-compatible, researchers will still be able to compare with TER from prior works.

IV-C2 Utterence-Level comparison

Refer to caption
Figure 5: TER vs mTER scatter plots

Figure 5 shows mTER vs TER of all testing utterances on whisper_large_v1 model. The samples on the red diagonal line represent utterances whose mTER is precisely consistent with TER, accounting to 87% of the total samples. The remaining 13% samples illustrated how mTER and TER are correlated, and how mTER wrapped overflowed TER (i.e. TER >> 100%) into [0%, 100%].

Figure 4 provides an utterance from our benchmark on Whisper, showing a closer look at the discrepancy between TER and mTER. The recognition result contains 23 words in total, among which 9 words are hallucinated. In this case, mTER (43.38%) is intuitively better than TER (76.92%) because it reflects the valid portion of the ASR output more precisely. TER is vulnerable to overflow, especially in short utterances and for insertion errors like model hallucination.

To sum up, through large-scale empirical experiments, we found the proposed mTER is robust. i) It is no longer necessary for scoring tools, pipelines, and documentation to differentiate refref and hyphyp, since mTER is symmetric. ii) The numbers of evaluation results are guaranteed to be normalized, with no more confusing overflowed error rate. In practice, the adoption of mTER is demonstrated to be backward-compatible the with the existing TER in dataset level evaluations.

V Conclusion

This paper introduces SpeechColab Leaderboard, an open-source evaluation platform for Automatic Speech Recognition (ASR). Base on the proposed platform, we conduct and report an extensive benchmark, revealing the landscape of state-of-the-art ASR systems from both research and industry. In addition, our study quantitatively investigates the relevance of different components within the evaluation pipeline, which presents a valuable reference for the community to build serious ASR benchmarks in the future. Furthermore, by leveraging our platform and the benchmark results, we propose a new metric, modified Token Error Rate (mTER), which is more robust and elegant than the traditional Token Error Rate (TER). In the future, we intend to incorporate more datasets and models into the platform.

Acknowledgments

We would like to thank Yong LIU for his help to integrate the TEDLIUM-3 evaluation sets.

References

  • [1] L. Deng, P. Kenny, M. Lennig, V. Gupta, F. Seitz, and P. Mermelstein, “Phonemic hidden Markov models with continuous mixture output densities for large vocabulary word recognition,” IEEE Transactions on Signal Processing, vol. 39, no. 7, pp. 1677–1681, 1991.
  • [2] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2011.
  • [3] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI.” in Interspeech, 2016, pp. 2751–2755.
  • [4] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
  • [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [6] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” Interspeech 2020, 2020.
  • [7] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
  • [8] A. Graves, “Sequence transduction with recurrent neural networks,” Computer Science, vol. 58, no. 3, pp. 235–242, 2012.
  • [9] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  • [10] Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang et al., “Google USM: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  • [11] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  • [12] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  • [13] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  • [14] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning.   PMLR, 2022, pp. 1298–1312.
  • [15] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey et al., “The HTK book,” Cambridge university engineering department, vol. 3, no. 175, p. 12, 2002.
  • [16] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. CONF.   IEEE Signal Processing Society, 2011.
  • [17] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211.
  • [18] O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook et al., “Nemo: A toolkit for building ai applications using neural modules,” arXiv preprint arXiv:1909.09577, 2019.
  • [19] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
  • [20] Z. Yao, D. Wu, X. Wang, B. Zhang, F. Yu, C. Yang, Z. Peng, X. Chen, L. Xie, and X. Lei, “WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in Proc. Interspeech.   Brno, Czech Republic: IEEE, 2021.
  • [21] M. Del Rio, N. Delworth, R. Westerman, M. Huang, N. Bhandari, J. Palakapilly, Q. McNamara, J. Dong, P. Zelasko, and M. Jetté, “Earnings-21: A practical benchmark for ASR in the wild,” arXiv preprint arXiv:2104.11348, 2021.
  • [22] S. Gandhi, P. Von Platen, and A. M. Rush, “ESB: A benchmark for multi-domain end-to-end speech recognition,” arXiv preprint arXiv:2210.13352, 2022.
  • [23] A. Faria, A. Janin, K. Riedhammer, and S. Adkoli, “Toward zero oracle word error rate on the switchboard benchmark,” arXiv preprint arXiv:2206.06192, 2022.
  • [24] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  • [25] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve, “TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation,” in Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20.   Springer, 2018, pp. 198–208.
  • [26] G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang et al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio,” in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021.   International Speech Communication Association, 2021, pp. 4376–4380.
  • [27] C. Wang, M. Rivière, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” in ACL 2021-59th Annual Meeting of the Association for Computational Linguistics, 2021.
  • [28] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common Voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Language Resources and Evaluation Conference, 2020, pp. 4218–4222.
  • [29] F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo, Z. Yao, and D. Povey, “Pruned RNN-T for fast, memory-efficient ASR training,” arXiv preprint arXiv:2206.13236, 2022.
  • [30] E. Bakhturina, Y. Zhang, and B. Ginsburg, “Shallow fusion of weighted finite-state transducer and language model for text normalization,” arXiv preprint arXiv:2203.15917, 2022.
  • [31] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: A general and efficient weighted finite-state transducer library: (extended abstract of an invited talk),” in Implementation and Application of Automata: 12th International Conference, CIAA 2007, Praque, Czech Republic, July 16-18, 2007, Revised Selected Papers 12.   Springer, 2007, pp. 11–23.
  • [32] K. Gorman, “Pynini: A python library for weighted finite-state grammar compilation,” in Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, 2016, pp. 75–80.
  • [33] K. Gorman and R. Sproat, “Finite-state text processing,” Synthesis Lectures onSynthesis Lectures on Human Language Technologies, vol. 14, no. 2, pp. 1–158, 2021.
  • [34] V. Maier, “Evaluating RIL as basis of automatic speech recognition devices and the consequences of using probabilistic string edit distance as input,” Univ. of Sheffield, third year project, 2002.
  • [35] A. C. Morris, V. Maier, and P. Green, “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition,” in Eighth International Conference on Spoken Language Processing, 2004.
  • [36] P. M. B. Vitányi, F. J. Balbach, R. L. Cilibrasi, and M. Li, Normalized Information Distance.   Boston, MA: Springer US, 2009, pp. 45–82. [Online]. Available: https://doi.org/10.1007/978-0-387-84816-7_3
  • [37] R. Cilibrasi and P. M. Vitányi, “Clustering by compression,” IEEE Transactions on Information theory, vol. 51, no. 4, pp. 1523–1545, 2005.
  • [38] M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang, “An information-based sequence distance and its application to whole mitochondrial genome phylogeny,” Bioinformatics, vol. 17, no. 2, pp. 149–154, 2001.
  • [39] X. Chen, B. Francia, M. Li, B. Mckinnon, and A. Seker, “Shared information and program plagiarism detection,” IEEE Transactions on Information Theory, vol. 50, no. 7, pp. 1545–1551, 2004.
  • [40] R. L. Cilibrasi and P. M. B. Vitanyi, “Normalized web distance and word similarity,” Computer Science, pp. 293–314, 2009.