EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge
Abstract
Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. Building on EmergentTTS, we introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI’s 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences. We open source the evaluation code111https://github.com/boson-ai/EmergentTTS-Eval-public and the dataset222https://huggingface.co/datasets/bosonai/EmergentTTS-Eval.
1 Introduction
Recent breakthroughs in generative modeling have led to significant advancements in Text-to-Speech (TTS) technology seed-tts ; xu2025qwen25omnitechnicalreport ; lajszczak2024base ; sparktts ; cosyvoice . State-of-the-art proprietary systems now demonstrate remarkable naturalness and human-like quality when converting standard, well-formed text into spoken language. These systems are widely deployed in various applications, including virtual assistants voicequalitytech , audiobooks walsh2023largescaleautomaticaudiobookcreation ; pethe2025prosodyanalysisaudiobooks , navigation systems inbookcartts ; speechrecogincar , and accessibility tools Raffoul2023 ; tts4disability . However, as TTS technology becomes more integrated into real-world use cases, systems increasingly encounter complex and diverse text prompts that go beyond conventional reading tasks, such as code switching, or rendering complex technical character sequences.
Conversely, evaluation methodologies for TTS systems have not kept pace with the growing complexity of use cases. Current benchmarks exhibit several limitations: they often use restricted text domains xia2024iscslp2024conversationalvoice , the lack diversity in linguistic phenomena wang2024evaluatingtexttospeechsynthesislarge , and they rely on costly, non-reproducible human evaluations that may vary significantly across different listener cohorts. Even worse, code-switching in multiple languages requires extremely polyglot evaluators (or many specialized ones). Thus, for reasons of practicality, many evaluations focus on voice cloning alone.
Real-world TTS applications encounter numerous challenges that remain difficult for current systems. These include accurately reflecting human emotions and sounds Barakat2024 -for example, when narrating various types of books like fantasy or children’s literature, TTS systems must realistically handle quoted dialogues and paralinguistic cues to keep listeners engaged. Another dimension involves more formal scenarios, such as syntactically complex text with nested clauses in legal and literary contexts, or scientific and academic texts containing special characters and equations that are difficult to pronounce. Additionally, there is a growing need for TTS systems to handle multilingual content lou2025generalizedmultilingual ; Cho_2022 ; Sellam_2023 and properly intonate questions with contextually appropriate prosody intonation4tts , challenges that current evaluation frameworks fail to systematically address. An evaluation methodology is required that reliably captures TTS performance across all these scenarios, moving away from subjective human assessments of expressiveness and prosody.
To this end, we propose EmergentTTS-Eval, a comprehensive benchmark specifically designed to evaluate TTS performance across these challenging scenarios. Our benchmark covers six critical dimensions. Through an iterative refinement process, we are able to controllably generate increasingly more difficult utterances for TTS systems to synthesize. Furthermore, drawing parallel from the textual domain, where reward LLMs are widely used for judging output of other LLMs, we propose to use the model-as-judge paradigm for evaluating TTS systems. Our contributions are as follows: {itemize*}
We create a benchmark with 1,645 samples for evaluating TTS systems across six challenging scenarios: Emotions, Paralinguistics, Syntactic Complexity, Questions, Foreign Words and Complex Pronunciation.
We propose an iterative refinement strategy with LLMs that creates increasingly complex utterances for TTS, resulting in a multi-layered and diverse evaluation benchmark for evaluating all aspects of TTS performance.
We are the first to use Large Audio Language Models (LALMs) as reward models for judging otherwise subjective dimensions of audio, like expressiveness, prosody, pausing, stress and pronunciation accuracy. We show its effectiveness through human correlation. The results are stable under the choice of judger LALMs.
We evaluate leading open-source and closed-source TTS systems on our benchmark, showing how model-as-a-judge reveals finer-grained and systematic failures, and highlights the gap between closed-source and open-source models on specific aspects of speech generation.
2 Related Work
2.1 Text-to-Speech Model Evaluation Metrics
Traditional TTS evaluation rely on humans to provide a Mean Opinion Score (MOS) that is both costly and statistically noisy, due to its reliance on a changing pool of evaluators. Recent advances in automatic TTS evaluation typically rely on two metrics: the Word Error Rate (WER) seed-tts ; xu2025qwen25omnitechnicalreport ; lajszczak2024base ; cosyvoice , as calculated by using an ASR model to convert the generated speech back into text and compare with the reference text; a speaker-similarity score (SIM) seed-tts ; xu2025qwen25omnitechnicalreport ; lajszczak2024base ; sparktts , calculated by comparing the latent embeddings of generated vs. reference speech using an audio foundation model, such as WaveLM wavelm . Recent works also explored the use of models to directly predict MOS (Sim-MOS) by training on datasets such as The Samsung Open MOS Dataset Maniati_2022 and The VoiceMOS Challenge voicemoschallenge .
While these metrics serve to capture how natural or accurate a system sounds, their evaluation power is limited by the difficulty and expressivity of the voice dataset and cannot handle nuanced, context-sensitive phenomena such as emotional prosody or complex syntax. More recently, BASE-TTS lajszczak2024base introduced an emergent abilities test suite that probes seven linguistically motivated phenomena, such as compound nouns, emotions, foreign words, paralinguistics, etc., using 20 hand-crafted prompts per category. Although BASE-TTS shifts the focus toward higher-order TTS capabilities, its dataset size is limited and reliance on human expert judgers is costly.
Our work addresses these limitations by automating test-generation and expanding category coverage. In particular, we create progressively harder stimuli at scale to differentiate between high-performing TTS systems. Our framework thus bridges the gap between traditional metric-based evaluation and nuanced, reproducible benchmarking.
2.2 Model-as-a-Judge For Text-to-Speech Model Evaluation
A key weakness in previous benchmarks is the need for human judges. Recent years have seen a growing trend of integrating audio encoders with LLMs. This has resulted in large audio-language models (LALM) that excel at a variety of audio comprehension tasks wavelm ; geminiteam2025geminifamilyhighlycapable ; rubenstein2023audiopalmlargelanguagemodel ; tang2024salmonngenerichearingabilities ; chu2023qwenaudioadvancinguniversalaudio . SALMONN tang2024salmonngenerichearingabilities uses finetuned LALM to predict MOS, SIM and A/B testing scores. Wang et al. wang2025enablingauditorylargelanguage extends this by finetuning an LALM to also generate open-ended qualitative feedback, covering noisiness, distortion, prosody, etc., alongside scores. This approach leverages the LLM’s contextual knowledge to provide multi-dimensional evaluations more akin to a human expert. Chen et al. chen2025audio compiled the first corpus of human-written TTS evaluations (with overall MOS plus detailed error annotations) and used it to train an audio-augmented GPT model. The resulting system can describe speech quality degradations and compare two samples in free-form language and outperforms prior state-of-the-art MOS prediction models. WavReward ji2025wavrewardspokendialoguemodels employs a generalist reward model to score spoken dialogue quality across dimensions like clarity and expressiveness .
Our work not only use LALMs as judges but also to generate tests spanning categories of emergent TTS abilities. Our evaluation demonstrates that even out-of-the-box LALMs like Gemini-2.5 Pro are capable of evaluating emergent capabilities in SOTA TTS systems and produce A/B testing results that are highly-correlated with human preference.
3 EmergentTTS-Eval Benchmark
In this section, we describe how we construct the datasets in EmergentTTS-Eval, which covers 6 categories of challenging real-world TTS scenarios with varying levels of complexity. We also describe how the evaluation process is scaled with the help of Large Audio Language Model (LALM).


3.1 Dataset Construction
We follow two key guidelines when constructing text prompts in EmergentTTS-Eval: (1) the dataset should encompass real-world challenges faced by TTS systems, and (2) it should exhibit varying levels of difficulty to enable fine-grained assessment of system capabilities. To this end, we begin with a diverse set of seed prompts and iteratively expand their scope (breadth) and complexity (depth). This reflects the approach of progressively increasing instruction difficulty used in instruction tuning xu2023wizardlm . Our seed prompts are derived from a collection of 140 human curated samples introduced in the BASE-TTS (lajszczak2024base, ). These samples span 7 challenging TTS categories commonly encountered in real-world scenarios: “Compound Nouns”, “Emotions”, “Foreign Words”, “Paralinguistics”, “Questions”, “Syntactic Complexities”, and “Punctuation”, with 20 samples per category. Some prompts pose challenges for TTS systems because generating realistic speech requires a deep understanding of the text’s semantic context. Consider the following text prompt from the “Emotions” category: A profound sense of realization washed over Beal as he whispered, "You’ve been there for me all along, haven’t you? I never truly appreciated you until now.". An effective TTS system should recognize the emotional context and appropriately render the quoted sentence as a whisper to reflect Beal’s sentiment.
Although the BASE-TTS proposed set is of high quality, it is limited in its ability to explore the depth within each category and lacks broad diversity, as it was curated by a small group of individuals. However, complexity and diversity are essential for a robust evaluation benchmark, as they help assess challenging scenarios where system failures can significantly impact user experience. For example, we want to evaluate on real-world scenarios like sequential interrogative questions, sustained emotion synthesis with natural shift to contrasting emotions, multi-code switching, etc. In addition to the categories defined in BASE-TTS, we introduce a new category called “Complex Pronunciation”, which contains prompts featuring unusual characters, numerals, and tongue-twisters. We exclude the “Compound-Nouns” category due to it’s limited scope and the strong performance of current TTS systems according to manual assessment. We also drop the “Punctuations” category, as punctuation-related challenges are inherently addressed within other categories such as “Paralinguistics”, “Syntactic Complexity”, and “Complex Pronunciation”.
To enrich the complexity of the text prompts and improving their diversity, we leverage LLMs to iteratively refine the initial utterances. The LLM is first tasked to curate samples that improve the dataset breadth-wise, guided by explicit diversity enhancement criteria embedded in the prompt and reinforced by strict structural constraints. Afterwards, we apply an iterative refinement process to construct a multi-tiered dataset encompassing utterances of varying complexities. In the process, we take the base utterance , and create a deeper version through a specific refinement method for each category. can then be fed as input to the next refinement step, to get an even more challenging and so on. According to our experiments, the LLMs are able to generate strong refinements if we provide detailed criteria in the instruction and three refinement steps are sufficient. We share the prompts we use for all the categories in the Appendix A, and an example refinement for Paralinguistics and Foreign Words category is shown in Figures 2 and 2, respectively. Here are the description of the six categories in the final dataset: {description*}
Contain sequential questions and statements. This evaluates the TTS system’s ability in generating interrogative and declarative prosody.
Contains narrative text with long quoted dialogues of emotion intensification, followed by contrasting emotions.
Contains vocal interjections (Uhh, Hmmm), Onomatopoeia (Achoo, tick-tock, etc), Varied Emphasis markers (Capitalization, vowel elongation, syllable emphasis with hyphens), Pacing cues like ellipses (….) or punctuation (STOP.RIGHT.THERE), and stuttering (I-I-I d-didn’t, so so-so-so-sorry).
Covers 15 unique languages with idiomatically and prosodically rich phrases placed in between english text.
Covers complex text with garden-path sentences, deep nested clauses with centre embeddings, homographs, and other forms of syntactic complexity.
Texts with emails, phone numbers, URLs, Street Adresses, Location references, STEM equations, units and notations, Abbreviations-Both initialisms (pronounced letter by letter) and acronyms (pronounced as word), and tongue twisters.
Dataset Statistics:
For five of the categories that we use from BASE-TTS, we curate a total of 70 seed utterances by appending 50 breadth-wise expanded sentences along with 20 curated by human, after this, we perform three iterative refinement steps, resulting in additional samples. This results in samples each for these five categories. For “Complex Pronunciation”, we curate 60 breadth-wise diverse samples from scratch, which are turned into samples after three rounds of refinement. Subsequently, we add five complex short tongue twisters, each repeated multiple times. Based on our manual observation, TTS systems often struggle with repeated articulation-where a single slip can lead to a cascading effect. We report these findings in Section 4.2. Total sample count thus comes out to . Category-wise statistics are shown in the Appendix A.7.
3.2 Large-Audio-Language Model as Judge
Synthesizing speech for all benchmark utterances results in approximately minutes (or hours) of audio per TTS system. Evaluating this volume of audio through human raters is both time-consuming and resource-intensive, with limited reproducibility, and the need for specialized linguistic knowledge. To address these limitations, we employ Large-Audio-Language Models (LALMs) as automatic judges. Our benchmark specifically targets aspects of speech synthesis-such as prosody, pausing, expressiveness, and pronunciation-that are not adequately captured by traditional metrics like Word Error Rate (WER) or MOS-based quality estimators. Accurately assessing these dimensions requires a general-purpose, high-capacity audio understanding model.
We choose Gemini 2.5 Pro as our primary LALM-based judge due to its strong performance on established audio reasoning benchmarks such as MMAU (mmaumassiv, ) (See Appendix B for performance comparison of different LALMs on MMAU). Notably, it leverages inference-time scaling deepseekai2025deepseekr1incentivizingreasoningcapability ; liu2025inferencetimescalinggeneralistreward before producing outputs, which aligns well with the complexity of our evaluation tasks.
To evaluate a candidate TTS system , we compare it against a strong reference system , chosen to have low WER to ensure high-fidelity synthesis of the benchmark utterances. For each evaluation instance, both systems generate speech for the same input, and the outputs are randomly assigned as and to avoid positional bias. The LALM judge is provided with the original text, the associated category label, and a structured evaluation prompt that includes the target evaluation dimension (e.g., prosody, emotion), scoring rubric, and detailed category-specific reasoning guidelines. The model is then presented with the audio from , followed by a separation marker, and then the audio from .
The LALM returns a structured json response containing natural language justifications for the performance of each system, a comparative analysis highlighting key differences-annotated as either subtle or significant-a scalar score in the range for each system, and a final winner label: for a tie, if is preferred, and for . The prompt is designed to elicit chain-of-thought reasoning with time-stamp based analysis, and encourages the model to resolve borderline cases by articulating fine-grained distinctions and predict human-based preferences. The full judger prompts used for each category are shared in the Appendix C.3.
We adopt a win-rate-based metric to summarize performance. Let denote the win-rate of system relative to the baseline . This is computed as:
where corresponds to the randomized label assigned to , and is the total number of comparisons. A score of reflects parity with the baseline, while deviations indicate relative superiority or inferiority.
This evaluation protocol enables robust, interpretable, and reproducible TTS comparison at scale. Unlike human raters, the LALM offers consistent judgments across multilingual and prosodically rich utterances, and its outputs include timestamp-grounded rationales that support fine-grained diagnostic analysis as evidenced by examples provided in the Appendix D. Our experiments in Section 4.6.2 show that the judge-based win-rate has high correlation with human preference as well.
4 Experiments
4.1 Setup
Models Evaluated
We evaluate seven open-source models: Suno Bark (TTS) (suno-bark, ), Sesame-1B (TTS) (sesame1B, ), Zyphra/Zonos (TTS) (zyphra-zonos-v01, ), Tortoise-TTS (TTS) (tortoise-tts, ), MiniCPM (LALM) (minicpm-o, ), Qwen2.5 Omni (LALM) (xu2025qwen25omnitechnicalreport, ), and Orpheus-TTS (TTS) (orpheustts, ). In addition, we benchmark four closed-source systems using their flagship models: ElevenLabs’ multilingual-v2 (TTS), Deepgram’s Aura-2 (TTS), HumeAI’s Octave (TTS), and OpenAI’s GPT-4o suite, which includes both TTS and audio reasoning variants-gpt-4o-mini-tts (TTS), gpt-4o-audio-preview-2024-12-17 (LALM), and gpt-4o-mini-audio-preview-2024-12-17 (LALM). For models that are fine-tuned on specific voices, we pre-select some of these voices to show the main results. As we show later in Section 4.4, the final win-rate can be sensitive to the voice. In addition to the win-rate as described in Section 3.2, we follow standard practice by computing WER using Whisper-v3-large (radford2022whisper, ), and MOS scores are calculated using a fine-tuned wav2vec2.0 model (Andreev_2023, ).
Prompting
For pure TTS models such as Sesame1B, Orpheus-TTS, Aura-2, and Eleven Multilingual v2, we directly provide the utterance text. For other models, we compare a basic prompting setup (utterance only) with a Strong Prompting strategy, where the input is augmented with category-specific instructions (e.g., “be emotionally expressive” for the Emotions category). For HumeAI and GPT-4o-mini-tts, these instructions are passed via optional style descriptors; for LALMs like Qwen 2.5 Omni and GPT-4o audio variants, they are included in the user message.
We calculate the win-rate of all evaluated models against gpt-4o-mini-tts(Alloy voice), judger temperature is set to . More details about hyper-parameters for the judger and evaluated models, along with the full prompting templates used to generate audios are provided in the Appendix C.
4.2 Benchmark Performance
Overall Results:
Model performance, summarized in Table 1, reveals a broad spectrum of win-rates ranging from 8.90% to 65.17%. GPT-4o-Audio (Ballad voice) achieves the highest overall performance, with particularly strong results in the expressiveness-focused categories-88.84% in Emotions and 82.14% in Paralinguistics. Notably, only GPT-4o-mini-tts with strong prompting surpasses the 50% win-rate in the Complex Pronunciation category, suggesting targeted optimization by OpenAI for this capability. HumeAI ranks as the second-best closed-source system, outperforming Deepgram’s Aura-2 (Thalia) and ElevenLabs’ Multilingual v2 (Brian). The low performance of Aura-2 in multilingual settings aligns with its lack of explicit multilingual support; when the Foreign Words category is excluded, its win-rate rises to approximately 35%, slightly above ElevenLabs. Among open-source models, Orpheus-TTS performs best, with Qwen 2.5 Omni following closely. In contrast, Bark and Sesame1B exhibit significant performance deficits, particularly in the Emotions category. All open-source models perform very poorly on the Complex Pronunciation category. We observe that strong prompting consistently enhances performance for all models where both prompted and unprompted evaluations are available. For example, GPT-4o-mini-tts reaches a 56% win-rate under strong prompting, showing a clear improvement over its baseline configuration. A similar gain is observed for GPT-4o-audio-preview. Win rates and MOS scores measure different aspects of speech quality. For instance, while Deepgram achieves the highest MOS score, several models with lower MOS scores have higher win rates. Similarly, Bark outperforms some open-source models on MOS but significantly underperforms on win rate. Judger parsing failures stemmed from two issues: incorrect JSON formatting and reaching maximum token limits when judges became trapped in repetitive reasoning loops.
Depth-wise Performance Trends:
Figure 3 illustrates how model win-rates change across increasing refinement depths for each category. Models naturally cluster into high-performing (average win-rate 50%) and low-performing groups. Although we might expect deeper utterances to widen this performance gap-with strong models excelling and weaker ones faltering-our findings reveal more nuanced patterns. At higher complexity levels, both models may encounter difficulties, increasing the likelihood of ties. Additionally, strong models sometimes reveal systematic weaknesses when challenged by greater complexity, while lower-performing models occasionally match or exceed the baseline by avoiding specific failure modes. Nevertheless, four of our six categories exhibit clear depth-sensitive performance trends. The exceptions are Questions and Syntactic Complexity, where more subtle prosodic expectations result in less pronounced differentiation across depths.
Systematic Failures and Judger Insights:
Depth-wise analysis reveals consistent failure patterns and demonstrates our judger’s sensitivity to prosodic, phonetic, and semantic mismatches. Most open-source models handle Questions and Syntactic Complexity adequately, with Sesame1B being the notable exception due to flat intonation and poor pausing. Sesame1B particularly struggles with Emotions, often inserting random interjections or producing monotonous speech. All open-source models underperform on Complex Pronunciation, misreading decimals, dropping digits, and breaking down at higher complexity, with MiniCPM and Tortoise-TTS failing completely even at depth 0. For Foreign Words, Sesame substitutes non-English tokens with unrelated content, while Orpheus anglicizes pronunciation to the extent of being phonetically incorrect.
Commercial models show different limitations: ElevenLabs falters with Complex Pronunciation, while Deepgram Aura-2 degrades with longer utterances and struggles with expressive Paralinguistics. OpenAI models excel in emotional and multilingual content but still exhibit subtle issues-occasional mispronunciations, dropped dates, and synthesis breakdowns-that our judger successfully identifies. The judger effectively distinguishes emphatic renditions, recognizes homograph disambiguation, and rewards appropriate prosody, though subtle paralinguistic nuances and emotional shifts remain challenging to evaluate perfectly. We provide detailed analysis of these failure modes and judger behavior in the Appendix D.
Model | Voice | Emotions | Foreign Words | Paralinguistics | Complex Pronunciation | Questions | Syntactic Complexity | Overall | |||||||||
WER | Win-Rate | WER | Win-Rate | WER | Win-Rate | WER | Win-Rate | WER | Win-Rate | WER | Win-Rate | WER | Win-Rate | Parsing Fail | MOS | ||
gpt-4o-mini-tts (baseline) | Alloy | 0.72 | - | 13.45 | - | 20.55 | - | 29.90 | - | 0.42 | - | 1.04 | - | 10.61 | - | - | 4.23 |
Suno Bark (suno-bark, ) | v2/en_speaker_6 | 4.31 | 0.00% | 26.11 | 10.89% | 33.26 | 6.60% | 55.88 | 8.36% | 3.01 | 15.00% | 6.07 | 12.50% | 20.71 | 8.90% | 0 | 3.61 |
Sesame1B (sesame1B, ) | - | 17.07 | 7.32% | 45.27 | 10.35% | 49.63 | 18.92% | 80.97 | 7.40% | 2.74 | 31.78% | 4.30 | 18.88% | 32.32 | 15.96% | 4 | 3.38 |
Zyphra/Zonos (zyphra-zonos-v01, ) | exampleaudio | 7.32 | 9.67% | 28.52 | 11.96% | 25.33 | 13.75% | 45.00 | 7.95% | 7.66 | 26.78% | 4.13 | 28.13% | 19.12 | 16.55% | 2 | 3.39 |
Tortoise-TTS (tortoise-tts, ) | random | 13.04 | 17.92% | 29.61 | 10.00% | 64.93 | 14.28% | 51.87 | 1.59% | 10.44 | 28.28% | 6.35 | 30.82% | 28.62 | 17.67% | 1 | 3.03 |
MiniCPM (minicpm-o, ) | - | 12.36 | 31.83% | 33.46 | 6.42% | 58.48 | 21.50% | 82.15 | 1.84% | 5.21 | 32.50% | 3.08 | 37.50% | 31.40 | 22.36% | 4 | 3.54 |
Qwen 2.5 Omni (xu2025qwen25omnitechnicalreport, ) | Chelsie | 1.22 | 41.18% | 26.98 | 11.07% | 57.48 | 17.44% | 64.07 | 3.30% | 12.77 | 49.28% | 1.66 | 36.96% | 26.58 | 27.07% | 7 | 4.09 |
Qwen 2.5 Omni (xu2025qwen25omnitechnicalreport, ) | Chelsie | 2.41 | 41.60% | 26.77 | 11.42% | 58.44 | 20.25% | 49.51 | 6.12% | 0.87 | 51.78% | 3.47 | 38.57% | 23.03 | 28.77% | 1 | 4.09 |
Orpheus TTS (orpheustts, ) | Tara | 1.81 | 31.78% | 22.31 | 17.5% | 40.94 | 39.82% | 41.04 | 10.61% | 1.48 | 39.64% | 1.63 | 38.92% | 17.71 | 30.12% | 0 | 3.76 |
DeepGram Aura-2 | Thalia-en | 3.45 | 29.28% | 21.41 | 18.75% | 23.73 | 21.14% | 54.49 | 33.81% | 1.24 | 48.21% | 1.36 | 43.70% | 16.83 | 32.44% | 4 | 4.33 |
11Labs eleven multilingual v2 | Brian | 0.63 | 30.35% | 14.44 | 35.53% | 21.51 | 45.53% | 31.44 | 14.48% | 0.49 | 39.46% | 1.15 | 35.53% | 11.19 | 33.89% | 0 | 3.55 |
HumeAI | - | 0.83 | 61.60% | 21.05 | 34.64% | 19.84 | 36.91% | 37.14 | 34.28% | 0.38 | 43.21% | 0.93 | 44.64% | 12.85 | 42.73% | 1 | 4.18 |
gpt-4o-mini-tts | Alloy | 0.71 | 59.17% | 12.07 | 57.32% | 21.33 | 58.75% | 31.57 | 52.44% | 0.66 | 52.67% | 0.84 | 57.14% | 10.76 | 56.32% | 2 | 4.20 |
gpt-4o-mini-audio-preview | Alloy | 0.95 | 55.89% | 14.48 | 59.82% | 19.04 | 52.86% | 32.27 | 30.61% | 0.55 | 47.32% | 0.88 | 48.75% | 10.92 | 49.60% | 1 | 4.19 |
gpt-4o-mini-audio-preview | Alloy | 9.34 | 59.13% | 12.70 | 58.92% | 20.92 | 62.59% | 37.14 | 28.68% | 0.74 | 48.21% | 0.72 | 53.40% | 13.09 | 52.31% | 5 | 4.18 |
gpt-4o-audio-preview | Alloy | 1.03 | 48.57% | 14.72 | 60.17% | 23.16 | 66.78% | 35.89 | 40.81% | 1.19 | 47.5% | 1.25 | 57.14% | 12.38 | 53.76% | 0 | 4.09 |
gpt-4o-audio-preview | Alloy | 0.93 | 61.64% | 13.75 | 62.5% | 20.56 | 68.21% | 36.92 | 49.59% | 1.72 | 47.85% | 1.26 | 56.85 | 12.00 | 57.95% | 4 | 4.06 |
gpt-4o-audio-preview | Ballad | 1.82 | 88.84% | 13.30 | 60.17% | 21.15 | 82.14% | 35.32 | 40.40% | 1.38 | 56.96% | 1.16 | 59.53% | 11.87 | 65.17% | 4 | 3.83 |

4.3 Sensitivity to Judge
While Gemini 2.5 Pro achieves the highest performance on the MMAU mmaumassiv benchmark for audio understanding, we conducted an ablation study to assess how evaluation outcomes vary across different LALM judger models, both proprietary and open-source. Using identical audio inputs from candidate TTS systems, we varied the judger model across four closed-source and one open-source alternative. Results are shown in Table 2.
Our analysis reveals that Qwen 2.5 Omni performs poorly in the judging role, frequently producing parsing errors and yielding win-rates near 50% across the board-indicative of near-random behavior. In contrast, the remaining judger models (Gemini 2.0 Flash, Gemini 2.5 Flash, Gemini 2.5 Pro, GPT-4o-mini-audio, and GPT-4o-audio) exhibit strong agreement in their relative rankings, despite differences in absolute scores. This alignment is quantified by a high Kendall’s W coefficient of concordance (), indicating near-perfect inter-model consistency and further validating the robustness of our evaluation framework.
Judger Model | Gemini 2.0 Flash | Gemini 2.5 Flash | Gpt-4o-mini-audio | Gpt-4o-audio | Qwen 2.5 Omni | |||||
Evaluated Model | Win-Rate | Parsing Fail | Win-Rate | Parsing Fail | Win-Rate | Parsing Fail | Win-Rate | Parsing Fail | Win-Rate | Parsing Fail |
Sesame1B | 25.30% | 3 | 24.77% | 6 | 28.60% | 2 | 31.07% | 2 | 41.39% | 76 |
Qwen2.5 Omni Chelsie | 38.06% | 3 | 31.49% | 8 | 42.67% | 1 | 38.13% | 2 | 47.12% | 82 |
Qwen2.5 Omni Chelsie | 39.17% | 0 | 32.09% | 6 | 43.09% | 2 | 39.38% | 1 | 47.41% | 77 |
Orpheus-TTS Tara | 39.41% | 1 | 38.02% | 4 | 41.03% | 0 | 41.33% | 1 | 48.59% | 74 |
DeepGram Thalia | 40.79% | 0 | 36.27% | 2 | 43.10% | 0 | 37.26% | 0 | 47.84% | 70 |
ElevenLabs Brian | 44.79% | 1 | 41.14% | 0 | 48.93% | 1 | 44.22% | 0 | 48.98% | 67 |
Hume.AI | 47.99% | 1 | 40.34% | 3 | 46.20% | 0 | 47.20% | 1 | 49.42% | 76 |
gpt-4o-mini-tts Alloy | 54.43% | 0 | 53.43% | 0 | 52.06% | 0 | 51.51% | 1 | 50.31% | 63 |
gpt-4o-mini-audio-preview Alloy | 48.08% | 0 | 47.14% | 0 | 48.63% | 0 | 48.72% | 1 | 50.28% | 71 |
gpt-4o-mini-audio-preview Alloy | 51.18% | 1 | 49.57% | 0 | 47.29% | 1 | 50.12% | 0 | 49.10% | 73 |
gpt-4o-audio Alloy | 53.28% | 0 | 53.65% | 2 | 50.39% | 0 | 53.03% | 0 | 49.71% | 81 |
gpt-4o-audio Alloy | 54.98% | 1 | 57.06% | 3 | 50.54% | 0 | 54.74% | 0 | 50.69% | 73 |
gpt-4o-audio Ballad | 58.78% | 1 | 57.60% | 1 | 55.80% | 1 | 64.23% | 1 | 49.30% | 68 |
4.4 Understanding bias of specific voices
Most TTS models are tied to specific voices-either through fine-tuning or voice cloning-except for a few, such as Hume.AI and Sesame1B, which generate different voice for different utterances. To examine the impact of voice identity on performance, we measure the category-wise standard deviation in win-rate across multiple voices for four models: GPT-4o-mini-tts (6 voices: Alloy, Ballad, Ash, Coral, Nova, Onyx), Deepgram Aura-2 (6 voices: Thalia-en, Andromeda-en, Helena-en, Apollo-en, Arcas-en, Aries-en), Orpheus-TTS (Tara, Leah, Jess, Leo, Dan, Mia), and Qwen 2.5 Omni (2 voices: Chelsie and Ethan). Results are shown in Figure 4(a). We find that Emotions and Paralinguistics exhibit the highest sensitivity to voice variation, reflected in elevated standard deviations. This is consistent with the fact that voice fine-tuning often emphasizes expressive rendering, which these categories demand. In contrast, Pronunciation shows the least variance across voices, as it depends more on the inherent ability of the model and not the voice characteristics, other categories also show low variance generally.


4.5 Text Normalization
The main challenge of the complex pronunciation category lies in parsing uncommon characters and their groups, something that can be made easier by using Text Normalization(TN) techniques prior to sending the text to the TTS model. To this end, we do an ablation measuring the change in win-rate for various TN techniques. We also add the data point corresponding to an LLM (GPT-4.1-mini) acting as the TN, the results are in Table 3(a).
Text Normalization Method | Win-rate |
No TN | 51.69% |
WeText TN | 50.06% |
GPT-4.1-mini TN | 76.74% |
Model Judge | |
Gemini 2.5 Pro | 90.5% |
Gemini 2.0 Flash | 90.5% |
Gemini 2.5 Flash | 90.5% |
GPT-4o-audio | 90.5% |
Qwen 2.5 Omni | 88.1% |
GPT-4o-mini-audio | 76.2% |
We note that basic TN techniques do not always improve model performance on our benchmark and can make it worse. For instance, WeText (wetext-tn, ) converts ’$1,890.125375’ to ‘one thousand eight hundred and ninety point one dollars twenty five thousand three hundred and seventy five’, which harms TTS quality. Similarly, ’0’ is sometimes normalized to the informal ’oh’, which is not preferred in formal or decimal contexts. ’SQL’ was correctly normalized to ’S Q L’, but the baseline’s pronunciation ’Sequel’ was preferred. Using an LLM for TN resolves many of these issues and significantly improves win-rate, though some errors persist with the basic prompt that we used. We include more examples in the Appendix E.
4.6 Human-Model Alignment
4.6.1 Human Study Setup
We conducted human evaluation to measure the correlation between the model judges’ preference to that of human judges. We created an online survey using Gradio, where human judges were presented with pairs of audio clips generated by a baseline TTS and a comparison TTS and instructed to rate which is the better one (or tie). To ensure consistency in evaluation, participants were given instructions and evaluation criteria adapted from the prompts used for the model judges. The human preferences were then aggregated to compute the win-rate of each comparison model against the baseline, which was compared to the win-rates produced by model judges. For this study, we selected gpt-4o-mini-tts as the baseline and compared it against eight other models: Sesame1B, Deepgram, ElevenLabs, gpt-4o-mini-audio-preview, gpt-4o-mini-tts (SP), Hume AI, Orpheus-TTS Tara, and Qwen2.5-Omni Chelsie. These comparisons were evaluated by the following model judges: Gemini 2.5 Pro, Gemini 2.0 Flash, Gemini 2.5 Flash, GPT-4o-mini-audio, GPT-4o-audio, and Qwen 2.5 Omni.
A total of 512 audio pairs were sampled from these comparisons to ensure coverage across different categories and refinement depths. These were distributed among human judges, with each judge assigned between 149 and 150 audio pairs with some redundancy among the judges.
4.6.2 Agreement Between Human and Model Judgements
To evaluate alignment between human and model judgments, we computed the Spearman correlation between the comparison models’ rankings based on win-rates derived from human ratings and those derived from each model judge. As shown in Table 3(b), all judges achieved high correlation scores, suggesting that model judges closely mirrors human preferences in determining which TTS system performs better. We also analyzed the individual win rates of each comparison model (vs. the baseline) under both human and model evaluations. As shown in Figure 4(b), most model win rates are closely aligned with human judgment (within 95% CI), though discrepancies exist in some cases (e.g., Hume AI, Sesame1B), where the model (Gemini 2.5 Pro) over-estimates performance compared to human preference.
5 Limitations and Conclusion
There are two main limitations to our work related to the dataset creation and the LALM-as-judge paradigm. First, LALMs have inherent biases that may manifest in our synthetic dataset, such as preferences for literary language and formal phrasing patterns. For categories like Foreign Words and Syntactic Complexity, refinement level of depth= produces grammatically correct but somewhat artificial utterances that occur infrequently in natural communication, but still act as a solid stress-test for TTS systems. Additionally, our multilingual evaluation focuses on Latin transcriptions rather than native character sets, which doesn’t fully capture the challenges of true multilingual TTS. Regarding evaluation, using Gemini 2.5 Pro incurs substantial costs-approximately $50 per complete TTS system evaluation. Nevertheless, the strong ranking agreement observed across different judge models suggests opportunities for more economical alternatives without significant quality loss. We also observe that evaluating subjective aspects like emotions, prosody, and intonation can occasionally lead to LALM hallucinations, where judges incorrectly identify pronunciation issues. Despite these considerations, EmergentTTS-Eval represents a significant advancement in TTS evaluation methodology by addressing critical gaps in existing benchmarks. Our approach systematically challenges TTS systems across dimensions that conventional metrics overlook, while offering a scalable alternative to resource-intensive human evaluations. The strong correlation between our LALM judges and human preferences validates the approach, while the benchmark’s ability to reveal fine-grained performance differences demonstrates its practical utility for driving progress in creating more human-like synthetic speech.
6 Broader Impacts
EmergentTTS-Eval aims to accelerate the development of more expressive, accurate, and inclusive TTS systems, which can greatly benefit accessibility tools and enable more natural conversational interfaces across a variety of applications. However, highly convincing TTS systems could be used to perpetrate fraud or spread disinformation, and LALM judges may perpetuate biases. To mitigate these risks, we encourage pairing TTS systems with deepfake detectors or watermarking and auditing prompt and judge outputs for diverse representation.
References
- (1) MiniCPM-o 2.6. https://huggingface.co/openbmb/MiniCPM-o-2_6.
- (2) sonoai/bark. https://github.com/suno-ai/bark.
- (3) tortoise-tts. https://github.com/neonbjb/tortoise-tts.
- (4) wenet-e2e/WeTextProcessing. https://github.com/wenet-e2e/WeTextProcessing.
- (5) Zyphra/Zonos-v0.1. https://huggingface.co/Zyphra/Zonos-v0.1-transformer.
- (6) Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, Junjie Pan, Xin Wang, Yuping Wang, Yuxuan Wang, Zhen Wei, Jian Wu, Chao Yao, Yifeng Yang, Yuanhao Yi, Junteng Zhang, Qidi Zhang, Shuo Zhang, Wenjie Zhang, Yang Zhang, Zilin Zhao, Dejian Zhong, and Xiaobin Zhuang. Seed-tts: A family of high-quality versatile speech generation models, 2024.
- (7) Pavel Andreev, Aibek Alanov, Oleg Ivanov, and Dmitry Vetrov. Hifi++: A unified framework for bandwidth extension and speech enhancement. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 1–5. IEEE, June 2023.
- (8) Huda Barakat, Oytun Turk, and Cenk Demiroglu. Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources. EURASIP Journal on Audio, Speech, and Music Processing, 2024(1), February 2024.
- (9) CanopyAI. Orpheus-tts. https://github.com/canopyai/Orpheus-TTS, 2025.
- (10) Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, and Eng Siong Chng. Audio large language models can be descriptive speech quality evaluators. In ICLR, 2025.
- (11) Hyunjae Cho, Wonbin Jung, Junhyeok Lee, and Sang Hoon Woo. SANE-TTS: Stable and natural end-to-end multilingual text-to-speech. In Interspeech 2022, page 1–5. ISCA, September 2022.
- (12) Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023.
- (13) DeepSeek-AI. Deepseek-R1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.
- (14) Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens, 2024.
- (15) Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, and Furu Wei. Wavllm: Towards robust and adaptive speech large language model, 2024.
- (16) Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, and Yu Tsao. The voicemos challenge 2024: Beyond speech quality prediction, 2024.
- (17) Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, and Zhou Zhao. Wavreward: Spoken dialogue models with generalist reward evaluators, 2025.
- (18) Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent Van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, et al. BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100k hours of data. arXiv preprint arXiv:2402.08093, 2024.
- (19) Javier Latorre, Kayoko Yanagisawa, Vincent Wan, Balakrishna Kolluru, and M.J.F. Gales. Speech intonation for tts: Study on evaluation methodology. 09 2014.
- (20) Jianli Liu and Jinying Chen. The Application of Speech Synthesis in Car Warning System, pages 657–662. 10 2014.
- (21) Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling, 2025.
- (22) Haowei Lou, Hye young Paik, Sheng Li, Wen Hu, and Lina Yao. Generalized multilingual text-to-speech generation with language-aware style adaptation, 2025.
- (23) Georgia Maniati, Alexandra Vioni, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, and Pirros Tsiakoulis. Somos: The samsung open mos dataset for the evaluation of neural text-to-speech synthesis. In Interspeech 2022, interspeech_2022. ISCA, September 2022.
- (24) Bolaji Oladokun, Rexwhite Enakrire, Wole Olatokun, and Yusuf Ajani. Youth disability: Adopting text-to-speech and screen reader technologies for library and information services. 04 2024.
- (25) Bryan Patrick, Kaisha Stone, and Tommy Fred. Applications of voice quality technology in virtual assistants and call centers. 12 2024.
- (26) Charuta Pethe, Bach Pham, Felix D Childress, Yunting Yin, and Steven Skiena. Prosody analysis of audiobooks, 2025.
- (27) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022.
- (28) Sandra Raffoul and Lindsey Jaber. Text-to-speech software and reading comprehension: The impact for students with learning disabilities. Canadian Journal of Learning and Technology, 49(2):1–18, November 2023.
- (29) Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, and Christian Frank. Audiopalm: A large language model that can speak and listen, 2023.
- (30) S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark, 2024.
- (31) S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. MMAU: A massive multi-task audio understanding and reasoning benchmark. arXiv preprint arXiv:2410.19168, 2024.
- (32) Thibault Sellam, Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, and Jason Riesa. Squid: Measuring speech naturalness in many languages. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 1–5. IEEE, June 2023.
- (33) Sesame. Csm 1b. https://huggingface.co/sesame/csm-1b, 2025.
- (34) Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models, 2024.
- (35) Gemini Team. Gemini: A family of highly capable multimodal models, 2025.
- (36) Attila Vékony. Speech recognition challenges in the car navigation industry. pages 26–40, 08 2016.
- (37) Brendan Walsh, Mark Hamilton, Greg Newby, Xi Wang, Serena Ruan, Sheng Zhao, Lei He, Shaofei Zhang, Eric Dettinger, William T. Freeman, and Markus Weimer. Large-scale automatic audiobook creation, 2023.
- (38) Siyang Wang and Éva Székely. Evaluating text-to-speech synthesis from a large discrete token-based speech language model, 2024.
- (39) Siyin Wang, Wenyi Yu, Yudong Yang, Changli Tang, Yixuan Li, Jimin Zhuang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, and Chao Zhang. Enabling auditory large language models for automatic speech quality evaluation, 2025.
- (40) Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, and Wei Xue. Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens, 2025.
- (41) Kangxiang Xia, Dake Guo, Jixun Yao, Liumeng Xue, Hanzhao Li, Shuai Wang, Zhao Guo, Lei Xie, Qingqing Zhang, Lei Luo, Minghui Dong, and Peng Sun. The iscslp 2024 conversational voice clone (covoc) challenge: Tasks, results and findings, 2024.
- (42) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- (43) Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025.
APPENDIX
The appendix is organized as follows:
Section A: Prompts and details for breadth and depth refinement of each category, along with final dataset statistics.
Section B: MMAU benchmarking of LALMs.
Section C: Evaluation related details, such as hyperparameters, audio generation prompts, and prompts for the judger.
Section D: Analysis of Gemini-2.5-Pro as a judger and the case of Audio Subjectivity.
Section E: Text Normalization prompt and examples.
Appendix A Per-Category Depth and Breadth Refinement
For breadth-expansion, we leverage long-thinking LLMs like Gemini 2.5 Pro, GPT-o3 and Claude 3.7 Sonnet. We prompt all three with the same breadth prompt, and through manual analysis select the LLM which produces the best breadth expansion and report the same for each category below. Next, all depth-refinements are achieved using Gemini-2.5-Pro. We create the following template, which is populated with the **text_to_synthesize**, and the refinement method for each category **complex_prompt_method**. The depth refinement prompt has a field tts_synthesis_diversity required in the LLM output, this is the field where COT specifications are provided for each category to ensure high-quality diverse and complex refinements from the LLM. The template is:
In the following sections, we present the category-wise breadth expansion prompts and the complex_prompt_method used for each category.
A.1 Category 1: Questions

Breadth Expansion
We use 20 samples that were proposed for this category in BASE-TTS, and then prompt Gemini-2.5-Pro to curate 50 more samples that achieve a significantly wider coverage for evaluating TTS interrogative prosody. Beyond standard Yes/No and Wh- questions, we incorporate negative questions, rhetorical questions, declarative questions expressing surprise or doubt, hypotheticals, alternative questions involving lists, and questions featuring parenthetical elements or starting conjunctions, to get a total of 70 samples with richer syntactic diversity and broader prosodic demands. The breadth expansion prompt used is:
Depth Refinement
To move beyond simple questions, we utilize the depth refinement strategy to generate utterances demanding highly varied interrogative and declarative prosody from the TTS system. We refine iteratively using one of three methods: appending a sequential question, appending a statement and then a question, or infusing pragmatic nuance before appending a question. The resulting dataset is designed to specifically evaluate a TTS system’s proficiency in: (a) rendering natural pitch contours across consecutive questions, (b) executing smooth prosodic transitions between declarative and interrogative speech within one utterance, and (c) conveying subtle pragmatic meanings (like skepticism or politeness) through appropriate intonational variation alongside the core question structure. Refer to Figure 5 for an example refinement and the depth-refinement prompt is as follows:
A.2 Category 2: Foreign Words

Breadth Expansion
The initial set of 20 BASE-TTS samples showed several constraints - they primarily featured European languages and frequently used easier to pronounce loanwords as the foreign word. To improve variety, we employed GPT-o3 to create an additional 50 samples. This expansion significantly enhances language diversity by incorporating more samples from the 10 most spoken languages around the world (Mandarin Chinese, Hindi, Spanish, French, Arabic, Russian, Portuguese, Japanese, German and Indonesian), with particular emphasis on uncommon foreign vocabulary that presents pronunciation challenges. All non-English words appear exclusively in standard Latin characters without using any special foreign alphabetic symbols; this design choice is intentional and follows our reasoning for testing the emergent capabilities of TTS systems, and while multilingual training data with foreign character set might be highly available, the same is very limited or non-existent with Latin characters only. The breadth expansion prompt used is as follows:
Depth Refinement
While evaluating TTS on isolated foreign words tests basic pronunciation, it doesn’t reflect the full complexity of natural bilingual communication, which frequently integrates longer foreign phrases, we fill this gap through the depth refinement strategy. This method systematically transforms simpler utterances into variants featuring more substantial foreign language segments, mimicking how bilingual individuals speak and write. Guided by a specific prompt, an LLM applies one of three approaches: (i) replacing isolated words with idiomatic phrases, (ii) expanding short phrases by absorbing and translating adjacent English context, or (iii) appending bilingual affixes to already complex sentences. This yields utterances requiring the TTS system to manage fluent code-switching and natural prosody over extended foreign elements, providing a more rigorous assessment of its capabilities. This refinement strategy resulted in complex code-switching, but due to grammatical differences between languages, often created awkward sentences by the final refinement. To remedy this, we post-process the output of each refinement step through a separate LLM call to gemini-2.5-pro, to fix any grammatical and syntactical issues, we found this to be quite effective. Figure 6 illustrates this process, and the depth-refinement prompt is as follows:
We also share the system message used for the post-processing step of fixing grammatical awkwardness using LLM, the text_to_synthesize is provided as the user-message.
A.3 Category 3: Paralinguistics

Breadth Expansion
The initial set of 20 samples from BASE-TTS provided foundational coverage of common paralinguistic phenomena, including basic interjections (e.g., ‘Aha!’, ‘Oops’), simple vocal sounds (‘Yawn’, ‘psst’), and common hesitations (‘Uh’, ‘Hmm’). However, this initial set lacked significant diversity. It underrepresented a wider spectrum of emotional interjections, varied onomatopoeia (spanning bodily, environmental, and animal sounds), nuanced textual emphasis cues, explicit pacing markers, and complex speech disfluencies. The breadth-expanded set with 50 additional samples significantly addressed these gaps by incorporating 5 types of cues such as: (i) additional interjections (e.g., Eww, Gasp, Tsk tsk, Oy vey), (ii) diverse onomatopoeia (e.g., Achoo, Tick-tock, Pitter-patter), (iii) varied emphasis markers (e.g., capitalization like REALLY, vowel elongation like sooooo, hyphenation like ab-so-lutely or Un-der-STAND), (iv) pacing cues via ellipses (…) or punctuation (STOP. RIGHT. THERE.), and (v) stuttering representations (e.g., I-I, N-N-NO). The breadth-expansion is achieved using Claude 3.7 Sonnet using the following prompt:
Depth Refinement
To get a representative set of paralinguistic cues that occur in written dialogue aiming to convey expressive speech, such as found in scripts, fictional narratives, and certain forms of informal communication, we apply the depth-refinement strategy. The refinement prompt uses the 5 defined types of paralinguistic cues and rewrites the text by incrementally adding one more or two cues of any type at each step. By the final refinement step, this process yields texts with multiple distinct paralinguistic cues, designed to work together to create a unique and realistic challenge for the TTS system. An example refinement is shown in Figure 7, and the prompt used is:
A.4 Category 4: Emotions

Breadth Expansion
The initial dataset of 20 samples from BASE-TTS, while covering foundational emotions like joy, anger, and sadness, exhibited limitations in emotional granularity and contextual depth necessary for comprehensive TTS evaluation. It predominantly featured strong, primary emotions and lacked sufficient diversity in more nuanced states such as sarcasm, envy, resignation, or complex blends like bitter-sweetness. To address these gaps, an additional set of 50 samples was curated using Claude 3.7 Sonnet, specifically designed to significantly expand the emotional palette and sentence structural variety. This augmented set incorporates a wider spectrum of subtle and complex affective states, embedded within richer narrative contexts that provide stronger implicit cues for the target prosodic realization. The resulting 70-sample dataset thus offers enhanced evaluative robustness, enabling a more rigorous assessment of a TTS system’s ability to synthesize expressive dialogues. The prompt used for breadth expansion is:
Depth Refinement
We leverage the depth refinements to test TTS systems on more than just producing single, unchanging emotions. This approach checks two key things: first, how well the system changes its emotional expression when the text suggests a shift (like moving from happy to sad), and second, how realistically it can keep an emotion going or make it stronger within a single piece of dialogue, just like people do when they speak naturally. We refine the base samples to introduce increased complexity, primarily through two mechanisms: either incorporating a distinct contrasting emotional state-often signaled via brief preceding or subsequent narrative cues-or by deepening and intensifying an existing emotion within a specific dialogue segment, thereby extending the utterance. Emphasis was placed on ensuring the plausibility of these emotional arcs through natural narrative flow, and matching the existing language style of the text to ensure overly formal language is not introduced where it does not fit. Refer to refinements in Figure 8 and the prompt used is:
A.5 Category 5: Syntactic Complexity

Breadth Expansion
The initial 20 samples effectively tested TTS prosody on deep center-embedding and long subject-verb dependencies but notably omitted other crucial structures reliant on prosody, such as inversion, cleft sentences, ellipsis (gapping), complex clausal subjects (Wh-/That-/gerunds), and nuanced punctuation cues (semicolons, dashes). The 50 samples curated through breadth expansion using Gemini-2.5-pro rectify these specific omissions by introducing robust examples across these categories, significantly broadening the structural diversity and creating a more comprehensive benchmark for evaluating a TTS system’s handling of complex syntax. The resulting 70-sample dataset therefore provides a more robust and syntactically varied test suite for assessing the prosodic competence of TTS systems when faced with intricate grammatical structures. The breadth expansion prompt is as follows:
Depth Refinement
While breadth-wise expansion verifies coverage across diverse syntactic phenomena, depth-wise refinement is crucial for assessing a TTS system’s robustness and performance scalability when faced with escalating grammatical intricacy. This approach tests the system’s ability to manage compounded syntactic load and maintain prosodic coherence under increasing structural demands, rather than merely handling isolated complexities. Our refinement strategy involved iteratively enhancing base complex sentences by applying targeted syntactic transformations-such as introducing complex coordination, structural reordering (e.g., fronting, passivization impacting dependencies), complicating ellipsis, or adding layered subordination-while strictly enforcing constraints on grammatical correctness and naturalness. The refinement is also encouraged to add two words that are homographs of each other if it fits naturally in the context, this tests the ability of the TTS to disambiguate the meaning from context and correctly pronounce it. The resulting dataset provides a graded challenge, enabling evaluation of how TTS prosody adapts to and conveys meaning across incrementally complex, yet plausible, sentence structures. Refer to Figure 9 for an example refinement, and the prompt used is:
A.6 Category 6: Complex Pronunciation

Breadth Expansion
We create this category from scratch by prompting Claude 3.7 Sonnet to generate 60 samples, 10 from each of the following 6 categories, (i) Numerals and Currencies, (ii) Dates and Times, (iii) Emails, URLs and Passwords, (iv) Addresses and Location references, (v) STEM Notations and equations (vi) Mixed Acronyms/Initialisms. In addition to this, we add 5 short tongue-twisters that are repeated many times. For example, the “The Sixth Sick Sheikh’s Sixth Sheep’s Sick” tongue twister repeated 6 times. The prompt used is:
Depth Refinement
In this category, the goal of depth refinement is to progressively add complex, hard-to-pronounce elements to an utterance, while keeping them within the same sub-category as the original. With each refinement, we aim to increase the density of such elements. To avoid repeating the same pronunciation challenges, we prompt the refining LLM to suggest three novel ways to introduce complexity - distinct from what’s already present. Specifically, we ask for elements that are likely to challenge TTS systems, even if other parts are rendered correctly. This approach consistently produces utterances with multiple challenging components that TTS systems may struggle to synthesize. We do not apply this strategy for the tounge-twisters and keep them as is. An example refinement is given in Figure 10 and the prompt used is:
A.7 Final dataset statistics
Refer to Table 4 for final category-wise statistics.
Category | No. Characters | No. Words | Audio Length (s) | ||||||
Min | Avg | Max | Min | Avg | Max | Min | Avg | Max | |
Questions | 16 | 248.22 | 701 | 3 | 41.61 | 120 | 0.90 | 15.04 | 48.45 |
Foreign Words | 71 | 136.85 | 242 | 9 | 21.77 | 39 | 4.80 | 9.07 | 16.20 |
Paralinguistics | 28 | 127.36 | 319 | 5 | 19.30 | 49 | 2.15 | 9.23 | 22.30 |
Emotions | 102 | 340.04 | 676 | 18 | 57.58 | 107 | 6.15 | 21.80 | 45.20 |
Syntactic Complexity | 45 | 194.71 | 366 | 8 | 28.23 | 64 | 3.25 | 12.53 | 23.60 |
Complex Pronunciation | 104 | 260.35 | 920 | 8 | 35.28 | 139 | 8.45 | 25.56 | 94.70 |
Overall | 16 | 217.02 | 920 | 3 | 33.93 | 139 | 0.90 | 15.32 | 94.70 |
Appendix B MMAU performance for LALMs
MMAU [31] is an audio-reasoning benchmark, testing audio understanding models for reasoning across 3 categories, Speech, Music and Sounds. It evaluates 27 distinct skills through information extraction and reasoning tasks that require advanced reasoning, such as multi-speaker role mapping, emotional shift detection, and temporal acoustic event analysis. We run the evaluation ourselves on the test-mini subset of 1,000 samples for top closed-source LALM models, and the results are summarized in Table 5.
Model | Test-mini score |
Gemini 2.5 Pro | 68.60 |
Gemini 2.5 Flash | 65.20 |
Gemini 2.0 Flash | 62.10 |
Gpt-4o-audio | 59.20 |
Gpt-4o-mini-audio | 59.80 |
Appendix C Evaluation-related Details
C.1 Hyper-parameters
Data Depth Refinement:
Gemini 2.5 Pro is prompted with for creativity, and when doing depth refinement for steps.
Audio Generation:
Closed source TTS models like Aura-2, Eleven Multilingual v2, HumeAI and gpt-4o-mini-tts do not support a temperature parameter. For Sesame1B, Qwen2.5 Omni, gpt-4o-mini-audio-preview and gpt-4o-audio-preview we use a , and for orpheus tts, we use the recommended values of and =. We set the maximum output tokens to and ensure that for no system, the audio is being clipped. However, for MiniCPM, the audio is automatically clipped at 44 seconds, and there does not appear to be an exposed parameter to extend this limit. Suno-ai’s Bark only performs well up to 13, the official repository provides a recipe for generating longer audio by splitting the input into multiple sentences and concatenating the outputs - we adopt this method. Tortoise-TTS audio is clipped at around 27 seconds, even after setting max_mel_tokens to its maximum value of 600. For Zyphra’s Zonos model, we follow the inference code from the GitHub repository and use assets/exampleaudio.mp3 as the prompt audio for voice cloning.
Judger LALMs:
For the judger, we set for reproducibility, we find that while Gemini results are not deterministic even after setting a value of , the final win-rate does not change significantly across runs, change to be specific. The max output length is set to for the thinking models (Gemini 2.5 series), and for other judgers we use for ablation.
C.2 Prompts for Audio Generation
From the description map presented below, we select and send the description relevant to the specific category when using Strong Prompting with Hume AI and gpt-4o-mini-tts:
Following are the templates we use for normal prompting and strong prompting scenarios with LALMs like Qwen 2.5 Omni, gpt-4o-mini-audio-preview and gpt-4o-audio-preview.
Normal Prompt:
Strong Prompt:
The {{{descriptions}}} placeholder is replaced with the specific description of that category, as mentioned in the ALL_DESCRIPTIONS map.
C.3 Prompts for Judger and category-wise evaluation criteria
The judger is provided with the following prompt template:
After the above prompt, we append the audio from System , then we have the post audio 1 prompt, this design choice is adopted to provide effective seperation between the two audios.
The placeholder {{{evaluation_criterion}}} is replaced with the specific criteria for that category, as described in the map below:
Appendix D Analysis of Gemini-2.5-Pro as a Judger and the case of Audio Subjectivity
In this section, we analyze Gemini-2.5-Pro’s behavior as a judge across categories, examining both its strengths in detecting nuanced differences and its limitations in subjective scenarios.
Questions:
Gemini demonstrates strong capability in recognizing intonation patterns, correctly identifying rising and falling contours in most cases. The tie-breaking procedure works effectively, with the judge appropriately preferring subtle prosodic advantages (e.g., choosing natural rising intonation over flat delivery when both systems score equally). However, occasional misclassifications occur where flat intonation is incorrectly perceived as rising/falling, or where tie-breaking is applied to equivalent performances. These edge cases often involve subjective interpretations of how preceding words contribute to overall interrogative prosody beyond just the final pre-question-mark intonation.
Emotions:
While Gemini consistently identifies intended emotions from textual context and reliably rewards systems with perceptible emotional variation (e.g., GPT-4o-audio Ballad vs. baseline), challenges emerge with emotionally flat outputs. In such cases, the judge occasionally hallucinates emotional expression where none exists. Additionally, as noted in Section 4.4, voice characteristics introduce systematic biases: high-pitched voices may advantage certain emotions while deeper voices favor others. Close comparisons in this inherently subjective category often depend on subtle interpretative judgments.
Foreign Words:
Gemini excels at phonemic analysis, providing evidence-based reasoning by correctly matching synthesized sounds to intended pronunciations. For clear cases (heavily anglicized vs. native pronunciation), performance is robust. Remarkably, the judge detects subtle phonemic distinctions, such as correctly identifying when Spanish "tocayo" is pronounced "toh-KAI-yoh" instead of "toh-KAH-yoh"-differences sometimes requiring multiple human listening passes. However, this sensitivity sometimes leads to over-emphasis of minor phonetic variations, resulting in tie-breaking or scoring differences for similar pronunciations.
Paralinguistics:
The judge shows comprehensive understanding across all paralinguistic cues-interjections, onomatopoeia, emphasis markers, pacing cues, and stuttering. It accurately maps textual cues to vocal sounds, recognizing elongation, syllable stress, and capitalization emphasis. Fine-grained distinctions are captured, such as rewarding crisp "Pssst" rendering over less precise vocalizations. However, subjectivity in duration judgments (e.g., optimal length for "heyyyyyy") occasionally produces winner selection based on minimal temporal differences. Complex hyphenated emphasis like "TRU-ly ter-RI-ble" is handled well, by penalizing a system that does strict word-splitting errors while rewarding pronunciation that ignores hyphenation cues but still produces a natural pronunciation.
Syntactic Complexity:
Gemini reliably focuses on pausing and stress patterns crucial for syntactic disambiguation. Homograph resolution is particularly strong-correctly identifying when ElevenLabs rendered "minute-by-minute" with inconsistent pronunciations (my-NOOT and min-it) when both should be "min-it." Occasional errors involve misperceiving pause durations, either over- or under-estimating their length in the audio.
Complex Pronunciation:
This category exhibits negligible hallucinations, leveraging Gemini’s robust ASR capabilities. The judge provides detailed reasoning about which components are synthesized more accurately, enabling precise and fine-grained winner determination based on granular pronunciation analysis.
Subjectivity and Human Agreement:
Our human evaluation study yields a Krippendorff’s of , indicating weak-to-moderate inter-annotator agreement and confirming the subjective nature of many TTS quality judgments. This weak agreement reflects the genuine difficulty humans face in consistently evaluating expressive speech synthesis.
Implications for Automated Evaluation:
Despite any observed limitations, Gemini-2.5-Pro’s biases and occasional hallucinations are outweighed by crucial advantages for large-scale evaluation. Unlike human judges, the model provides consistent, reproducible assessments across thousands of samples, detailed timestamp-based reasoning, and scalable evaluation at a fraction of human annotation costs. The high correlation with human preferences ( Spearman correlation) and strong inter-judge agreement across different LALMs (Kendall’s ) demonstrate that while individual judgments may not be perfect, the overall ranking and comparative analysis remain reliable and actionable for TTS system development.
Appendix E Text Normalization prompt and examples
Detailed below is the prompt we use when GPT-4.1-mini acts as the Text Normalizer for the results present in Table 3(a). We use
In addition to the cases we mention in Section 4.5, we present some additional cases observing the effect of WeText-TN [4] and GPT-4.1-mini-TN.
WeText-TN:
Worked Correctly: (i) two and one third percent; (ii) version v four point oh point one one seven oh rc point three b+; (iii) 2024-09-1 fifteenth of September twenty twenty four.
Worked Incorrectly: (i) Ste. 1250-B Did not expand to Suite; (ii) tilde instead of approx; (iii) mL Ten ³ instead of Ten power 3; (iv) 12/19/24-01/12/25 twelve divided by nineteen divided by twenty… instead of reading as a date range; (v) CRISPR-Cas9 CRISPR-CA’s nine instead of pronouncing as word.
GPT-4.1-mini TN:
Worked Correctly: There are multiple cases across numerals, currencies, passwords, web-addresses, etc that worked very well with this TN technique.
Worked Incorrectly: (i) one thousand seventy-five point zero zero five instead of one point zero seven five zero zero five; (ii) UTC+11:00 Coordinated Universal Time plus 11 hours not preferred by judger over U T C plus eleven hours; (iii) … dollars and twelve cents five three seven five, this is misleading way to represent currency; (iv) , a case of over-normalization; (v) Many cases where abbreviations supposed to pronounced as a single word are separated letter by letter, and cases where abbreviations being expanded is not preferred by the judger.