Code accompanying the paper, "EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge"
Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. We introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences.
WER ↓ and Win-rate ↑ over all categories with gpt-4o-mini-tts-alloy as baseline and gemini-2.5-pro as judger; ⭐ indicates that the model was strongly prompted(see paper for definition). For some expensive to use models, we only do strong prompting.
Note
The generated audios for all of the models we evaluated, along with the predictions made by the gemini-2.5-pro judger are available in this Google Drive link.
Model | Voice | Overall WER ↓ | Overall Win-Rate ↑ | Emotions WER ↓ | Emotions Win-Rate ↑ | Foreign Words WER ↓ | Foreign Words Win-Rate ↑ | Paralinguistics WER ↓ | Paralinguistics Win-Rate ↑ | Complex Pronunciation WER ↓ | Complex Pronunciation Win-Rate ↑ | Questions WER ↓ | Questions Win-Rate ↑ | Syntactic Complexity WER ↓ | Syntactic Complexity Win-Rate ↑ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gemini-2.5-Flash-Preview-TTS⭐ | Zephyr | 10.39 | 70.70% | 0.71 | 95.89% | 11.80 | 58.45% | 18.38 | 91.25% | 33.57 | 55.73% | 0.40 | 63.03% | 0.50 | 57.85% |
Gemini-2.5-Flash-Preview-TTS | Zephyr | 10.32 | 69.95% | 0.60 | 88.75% | 12.99 | 58.21% | 18.25 | 88.75% | 31.90 | 61.41% | 0.36 | 63.75% | 0.73 | 57.67% |
Gemini-2.5-Pro-Preview-TTS⭐ | Zephyr | 11.79 | 69.32% | 0.87 | 86.91% | 16.22 | 58.24% | 20.87 | 82.32% | 34.75 | 64.76% | 0.72 | 61.25% | 0.87 | 61.78% |
gpt-4o-audio-preview⭐ | Ballad | 11.87 | 65.17% | 1.82 | 88.84% | 13.30 | 60.17% | 21.15 | 82.14% | 35.32 | 40.40% | 1.38 | 56.96% | 1.16 | 59.53% |
gpt-4o-audio-preview⭐ | Alloy | 12.00 | 57.95% | 0.93 | 61.64% | 13.75 | 62.50% | 20.56 | 68.21% | 36.92 | 49.59% | 1.72 | 47.85% | 1.26 | 56.85% |
gpt-4o-mini-tts⭐ | Alloy | 10.76 | 56.32% | 0.71 | 59.17% | 12.07 | 57.32% | 21.33 | 58.75% | 31.57 | 52.44% | 0.66 | 52.67% | 0.84 | 57.14% |
gpt-4o-audio-preview | Alloy | 12.38 | 53.76% | 1.03 | 48.57% | 14.72 | 60.17% | 23.16 | 66.78% | 35.89 | 40.81% | 1.19 | 47.50% | 1.25 | 57.14% |
gpt-4o-mini-audio-preview⭐ | Alloy | 13.09 | 52.31% | 9.34 | 59.13% | 12.70 | 58.92% | 20.92 | 62.59% | 37.14 | 28.68% | 0.74 | 48.21% | 0.72 | 53.40% |
BASELINE: gpt-4o-mini-tts | Alloy | 10.61 | 50% | 0.72 | – | 13.45 | – | 20.55 | – | 29.90 | – | 0.42 | – | 1.04 | – |
gpt-4o-mini-audio-preview | Alloy | 10.92 | 49.60% | 0.95 | 55.89% | 14.48 | 59.82% | 19.04 | 52.86% | 32.27 | 30.61% | 0.55 | 47.32% | 0.88 | 48.75% |
HumeAI⭐ | – | 12.85 | 42.73% | 0.83 | 61.60% | 21.05 | 34.64% | 19.84 | 36.91% | 37.14 | 34.28% | 0.38 | 43.21% | 0.93 | 44.64% |
minimax/speech-02-hd | English_expressive_narrator | 10.02 | 36.58% | 0.54 | 40.86% | 14.58 | 34.28% | 17.58 | 34.28% | 28.71 | 16.32% | 0.21 | 47.32% | 0.84 | 43.92% |
11Labs eleven multilingual v2 | Brian | 11.19 | 33.89% | 0.63 | 30.35% | 14.44 | 35.53% | 21.51 | 45.53% | 31.44 | 14.48% | 0.49 | 39.46% | 1.15 | 35.53% |
DeepGram Aura-2 | Thalia-en | 16.83 | 32.44% | 3.45 | 29.28% | 21.41 | 18.75% | 23.73 | 21.14% | 54.49 | 33.81% | 1.24 | 48.21% | 1.36 | 43.70% |
Orpheus TTS | Tara | 17.71 | 30.12% | 1.81 | 31.78% | 22.31 | 17.50% | 40.94 | 39.82% | 41.04 | 10.61% | 1.48 | 39.64% | 1.63 | 38.92% |
Qwen 2.5 Omni⭐ | Chelsie | 23.03 | 28.77% | 2.41 | 41.60% | 26.77 | 11.42% | 58.44 | 20.25% | 49.51 | 6.12% | 0.87 | 51.78% | 3.47 | 38.57% |
Qwen 2.5 Omni | Chelsie | 26.58 | 27.07% | 1.22 | 41.18% | 26.98 | 11.07% | 57.48 | 17.44% | 64.07 | 3.30% | 12.77 | 49.28% | 1.66 | 36.96% |
ResembleAI Chatterbox | - | 13.05 | 26.74% | 1.18 | 22.48% | 17.59 | 26.42% | 20.64 | 23.75% | 40.75 | 8.36% | 0.65 | 48.57% | 0.96 | 28.57% |
Kokoro-82M | af_heart | 13.41 | 23.46% | 0.71 | 13.92% | 22.17 | 12.67% | 28.37 | 10.00% | 30.10 | 23.06% | 0.56 | 40.89% | 0.65 | 40.17% |
MiniCPM-o | – | 31.40 | 22.36% | 12.36 | 31.83% | 33.46 | 6.42% | 58.48 | 21.50% | 82.15 | 1.84% | 5.21 | 32.50% | 3.08 | 37.50% |
Tortoise-TTS | random | 28.62 | 17.67% | 13.04 | 17.92% | 29.61 | 10.00% | 64.93 | 14.28% | 51.87 | 1.59% | 10.44 | 28.28% | 6.35 | 30.82% |
KyutAI-TTS | ex03-ex01_happy_001_channel1_334s | 12.72 | 20.91% | 0.83 | 29.46% | 16.47 | 16.07% | 26.41 | 26.96% | 33.43 | 6.73% | 0.61 | 30.00% | 1.19 | 14.46% |
Zyphra/Zonos | exampleaudio | 19.12 | 16.55% | 7.32 | 9.67% | 28.52 | 11.96% | 25.33 | 13.75% | 45.00 | 7.95% | 7.66 | 26.78% | 4.13 | 28.13% |
Sesame1B | – | 32.32 | 15.96% | 17.07 | 7.32% | 45.27 | 10.35% | 49.63 | 18.92% | 80.97 | 7.40% | 2.74 | 31.78% | 4.30 | 18.88% |
F5TTS | basic_ref_en | 16.47 | 15.31% | 0.70 | 26.78% | 23.51 | 1.78% | 31.66 | 21.60% | 42.41 | 1.42% | 1.62 | 14.82% | 2.14 | 23.75% |
Suno Bark | v2/en_speaker_6 | 20.71 | 8.90% | 4.31 | 0.00% | 26.11 | 10.89% | 33.26 | 6.60% | 55.88 | 8.36% | 3.01 | 15.00% | 6.07 | 12.50% |
VITS-VCTK | default | 27.45 | 7.64% | 16.34 | 0.00% | 47.45 | 4.54% | 51.12 | 4.10% | 44.30 | 17.82% | 2.37 | 15.53% | 5.24 | 5.07% |
- The accompanying code provides the pipeline for reproducing the evaluation results for the models evaluated in the paper, and evaluating your own model on our dataset.
- To evaluate your own model, implement the
generate_audio_out
andprepare_emergent_tts_sample
methods. If you want to load your model locally on GPU, implement it inmodel_clients.py
file otherwise if it's an API call, put it inapi_clients.py
. In either case, expose your client in theevaluation_runner.py
file following the structure of other clients. - We use accelerate for local inference, so if you implement in
model_clients.py
, ensure to place the model onaccelerator.device
.
First, clone the repo, create a virtual environment, and install the requirements.
git clone https://github.com/boson-ai/EmergentTTS-Eval
cd EmergentTTS-Eval
python3 -m venv .test_env
source .test_env/bin/activate
pip3 install -r requirements.txt
Then, download all of the relevant data using download_data.py
, this processes the baseline audio files, creates a jsonl structure for the data, and downloads the wv_mos checkpoint for MOS calculation.
python3 download_data.py
Note
If you are evaluating your own model or a new model from another provider, install any missing requirements for that model along with already installed requirements.txt, as the current requirements only cover the ones necessary for the benchmark and judger sdk's - google-generativeai and openai.
For models in api_clients.py
, we call the api with multiple threads for faster inference, the MOS model and WhisperV3 for transcription still needs to be placed in the GPU for this.
export <>_API_KEY=<key for model to evaluate, for openai models it's OPENAI_API_KEY, for deepgram, it's DEEPGRAM_API_KEY, for custom added api_client, follow the variable name you implement in api_clients.py>
export JUDGER_API_KEY=<api key for the judger model, either gemini or openai model>
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -u evaluation_runner.py \
--model_name_or_path "gpt-4o-mini-tts" \
--output_dir "<path_to_output_dir, absolute path is recommended>" \
--seed 42 \
--judge_model_provider "gemini-2.5-pro-preview-05-06" \
--api_num_threads 14 \
--temperature 1.0 \
--text_speech_strong_prompting \
--tts_judger_evaluate_function "win_rate" \
--baseline_audios_path <path/to/EmergentTTS-Eval/data/baseline_audios, these will be stored when you run download_data.py>
If the model is in model_clients.py
, evaluation will be run with accelerate. For example, to reproduce Orpheus-TTS results, install it's specific requirements(snac
is the only missing requirement for orpheus) and use the command:
MODEL_NAME="orpheus-tts-0.1-finetune-prod"
export HF_TOKEN=<your_hf_token>
export JUDGER_API_KEY=<api key for the judger model, either gemini or openai model>
accelerate launch --config_file default_accelerate_config.yaml evaluation_runner.py \
--model_name_or_path "canopylabs/orpheus-tts-0.1-finetune-prod" \
--output_dir "<path_to_output_dir, absolute path is recommended>" \
--seed 42 \
--judge_model_provider "gemini-2.5-pro-preview-05-06" \
--temperature 1.0 \
--tts_judger_evaluate_function "win_rate" \
--baseline_audios_path <path/to/EmergentTTS-Eval/data/baseline_audios, these will be stored when you run download_data.py>
To use your own Judger Model, we recommend using vLLM to create an OpenAI compatible server. Next, you just have to pass the flag --judger_base_url
pointing to the IP and port of the custom server.
- For single GPU setup, edit the
default_accelerate_config.yaml
file and usegpu_ids: 0,
andnum_processes: 1
. - Use the
--num_samples
parameter to evaluate on only a subset of samples. - Carefully tweak
--api_num_threads
andCUDA_VISIBLE_DEVICES
for api client based inference in your setup. We place--api_num_threads
instances of the MOS model and WhisperV3 uniformly in all visible gpus, so too high--api_num_threads
can result in Cuda OOM. - Omit passing the
--tts_judger_evaluate_function
if you do not want to use LALM judger to judge the output, and only calculate other metrics like WER, MOS, etc. - Pass
--text_speech_strong_prompting
if the client supports strong_prompting, as described in the paper. - To evaluate only on specific depths or categories, use the
--depths_to_evaluate
and--categories_to_evaluate
flags. - To use a specific voice of the model, if applicable, use the
--voice_to_use
paramter. --baseline_audios_path
needs to be passed if win_rate is calculated.- To disable audio generation and fetch generated audios from a path, use the
--fetch_audios_from_path
paramter, this can be useful for testing different judger and not changing the generated audios.
If our work proves useful to you, cite it as:
@misc{manku2025emergentttseval,
title={EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge},
author={Ruskin Raj Manku and Yuzhi Tang and Xingjian Shi and Mu Li and Alex Smola},
year={2025},
eprint={2505.23009},
archivePrefix={arXiv},
primaryClass={cs.LG}
}