EmergentTTS-Eval

Code accompanying the paper, "EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge"

Introduction

Text-to-Speech (TTS) benchmarks often fail to capture how well models handle nuanced and semantically complex text. We introduce EmergentTTS-Eval, a comprehensive benchmark covering six challenging TTS scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation (e.g. URLs, formulas), and questions. Crucially, our framework automates both test-case generation and evaluation, making the benchmark easily extensible. Starting from a small set of human-written seed prompts, we iteratively extend them using LLMs to target specific structural, phonetic and prosodic challenges, resulting in 1,645 diverse test cases. Moreover, we employ a model-as-a-judge approach, using a Large Audio Language Model (LALM) to assess the speech across multiple dimensions such as expressed emotion, prosodic, intonational, and pronunciation accuracy. We evaluate state-of-the-art open-source and proprietary TTS systems, such as 11Labs, Deepgram, and OpenAI's 4o-mini-TTS, on EmergentTTS-Eval, demonstrating its ability to reveal fine-grained performance differences. Results show that the model-as-a-judge approach offers robust TTS assessment and a high correlation with human preferences.

Leaderboard

WER ↓ and Win-rate ↑ over all categories with gpt-4o-mini-tts-alloy as baseline and gemini-2.5-pro as judger; ⭐ indicates that the model was strongly prompted(see paper for definition). For some expensive to use models, we only do strong prompting.

Note

The generated audios for all of the models we evaluated, along with the predictions made by the gemini-2.5-pro judger are available in this Google Drive link.

Model	Voice	Overall WER ↓	Overall Win-Rate ↑	Emotions WER ↓	Emotions Win-Rate ↑	Foreign Words WER ↓	Foreign Words Win-Rate ↑	Paralinguistics WER ↓	Paralinguistics Win-Rate ↑	Complex Pronunciation WER ↓	Complex Pronunciation Win-Rate ↑	Questions WER ↓	Questions Win-Rate ↑	Syntactic Complexity WER ↓	Syntactic Complexity Win-Rate ↑
Gemini-2.5-Flash-Preview-TTS⭐	Zephyr	10.39	70.70%	0.71	95.89%	11.80	58.45%	18.38	91.25%	33.57	55.73%	0.40	63.03%	0.50	57.85%
Gemini-2.5-Flash-Preview-TTS	Zephyr	10.32	69.95%	0.60	88.75%	12.99	58.21%	18.25	88.75%	31.90	61.41%	0.36	63.75%	0.73	57.67%
Gemini-2.5-Pro-Preview-TTS⭐	Zephyr	11.79	69.32%	0.87	86.91%	16.22	58.24%	20.87	82.32%	34.75	64.76%	0.72	61.25%	0.87	61.78%
gpt-4o-audio-preview⭐	Ballad	11.87	65.17%	1.82	88.84%	13.30	60.17%	21.15	82.14%	35.32	40.40%	1.38	56.96%	1.16	59.53%
gpt-4o-audio-preview⭐	Alloy	12.00	57.95%	0.93	61.64%	13.75	62.50%	20.56	68.21%	36.92	49.59%	1.72	47.85%	1.26	56.85%
gpt-4o-mini-tts⭐	Alloy	10.76	56.32%	0.71	59.17%	12.07	57.32%	21.33	58.75%	31.57	52.44%	0.66	52.67%	0.84	57.14%
gpt-4o-audio-preview	Alloy	12.38	53.76%	1.03	48.57%	14.72	60.17%	23.16	66.78%	35.89	40.81%	1.19	47.50%	1.25	57.14%
gpt-4o-mini-audio-preview⭐	Alloy	13.09	52.31%	9.34	59.13%	12.70	58.92%	20.92	62.59%	37.14	28.68%	0.74	48.21%	0.72	53.40%
BASELINE: gpt-4o-mini-tts	Alloy	10.61	50%	0.72	–	13.45	–	20.55	–	29.90	–	0.42	–	1.04	–
gpt-4o-mini-audio-preview	Alloy	10.92	49.60%	0.95	55.89%	14.48	59.82%	19.04	52.86%	32.27	30.61%	0.55	47.32%	0.88	48.75%
HumeAI⭐	–	12.85	42.73%	0.83	61.60%	21.05	34.64%	19.84	36.91%	37.14	34.28%	0.38	43.21%	0.93	44.64%
minimax/speech-02-hd	English_expressive_narrator	10.02	36.58%	0.54	40.86%	14.58	34.28%	17.58	34.28%	28.71	16.32%	0.21	47.32%	0.84	43.92%
11Labs eleven multilingual v2	Brian	11.19	33.89%	0.63	30.35%	14.44	35.53%	21.51	45.53%	31.44	14.48%	0.49	39.46%	1.15	35.53%
DeepGram Aura-2	Thalia-en	16.83	32.44%	3.45	29.28%	21.41	18.75%	23.73	21.14%	54.49	33.81%	1.24	48.21%	1.36	43.70%
Orpheus TTS	Tara	17.71	30.12%	1.81	31.78%	22.31	17.50%	40.94	39.82%	41.04	10.61%	1.48	39.64%	1.63	38.92%
Qwen 2.5 Omni⭐	Chelsie	23.03	28.77%	2.41	41.60%	26.77	11.42%	58.44	20.25%	49.51	6.12%	0.87	51.78%	3.47	38.57%
Qwen 2.5 Omni	Chelsie	26.58	27.07%	1.22	41.18%	26.98	11.07%	57.48	17.44%	64.07	3.30%	12.77	49.28%	1.66	36.96%
ResembleAI Chatterbox	-	13.05	26.74%	1.18	22.48%	17.59	26.42%	20.64	23.75%	40.75	8.36%	0.65	48.57%	0.96	28.57%
Kokoro-82M	af_heart	13.41	23.46%	0.71	13.92%	22.17	12.67%	28.37	10.00%	30.10	23.06%	0.56	40.89%	0.65	40.17%
MiniCPM-o	–	31.40	22.36%	12.36	31.83%	33.46	6.42%	58.48	21.50%	82.15	1.84%	5.21	32.50%	3.08	37.50%
Tortoise-TTS	random	28.62	17.67%	13.04	17.92%	29.61	10.00%	64.93	14.28%	51.87	1.59%	10.44	28.28%	6.35	30.82%
KyutAI-TTS	ex03-ex01_happy_001_channel1_334s	12.72	20.91%	0.83	29.46%	16.47	16.07%	26.41	26.96%	33.43	6.73%	0.61	30.00%	1.19	14.46%
Zyphra/Zonos	exampleaudio	19.12	16.55%	7.32	9.67%	28.52	11.96%	25.33	13.75%	45.00	7.95%	7.66	26.78%	4.13	28.13%
Sesame1B	–	32.32	15.96%	17.07	7.32%	45.27	10.35%	49.63	18.92%	80.97	7.40%	2.74	31.78%	4.30	18.88%
F5TTS	basic_ref_en	16.47	15.31%	0.70	26.78%	23.51	1.78%	31.66	21.60%	42.41	1.42%	1.62	14.82%	2.14	23.75%
Suno Bark	v2/en_speaker_6	20.71	8.90%	4.31	0.00%	26.11	10.89%	33.26	6.60%	55.88	8.36%	3.01	15.00%	6.07	12.50%
VITS-VCTK	default	27.45	7.64%	16.34	0.00%	47.45	4.54%	51.12	4.10%	44.30	17.82%	2.37	15.53%	5.24	5.07%

General Instructions

The accompanying code provides the pipeline for reproducing the evaluation results for the models evaluated in the paper, and evaluating your own model on our dataset.
To evaluate your own model, implement the generate_audio_out and prepare_emergent_tts_sample methods. If you want to load your model locally on GPU, implement it in model_clients.py file otherwise if it's an API call, put it in api_clients.py. In either case, expose your client in the evaluation_runner.py file following the structure of other clients.
We use accelerate for local inference, so if you implement in model_clients.py, ensure to place the model on accelerator.device.

Setup Instructions

First, clone the repo, create a virtual environment, and install the requirements.

git clone https://github.com/boson-ai/EmergentTTS-Eval
cd EmergentTTS-Eval
python3 -m venv .test_env
source .test_env/bin/activate
pip3 install -r requirements.txt

Then, download all of the relevant data using download_data.py, this processes the baseline audio files, creates a jsonl structure for the data, and downloads the wv_mos checkpoint for MOS calculation.

python3 download_data.py

Note

If you are evaluating your own model or a new model from another provider, install any missing requirements for that model along with already installed requirements.txt, as the current requirements only cover the ones necessary for the benchmark and judger sdk's - google-generativeai and openai.

Evaluation Instructions

Inference for API-served models

For models in api_clients.py, we call the api with multiple threads for faster inference, the MOS model and WhisperV3 for transcription still needs to be placed in the GPU for this.

export <>_API_KEY=<key for model to evaluate, for openai models it's OPENAI_API_KEY, for deepgram, it's DEEPGRAM_API_KEY, for custom added api_client, follow the variable name you implement in api_clients.py> 
export JUDGER_API_KEY=<api key for the judger model, either gemini or openai model>
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -u evaluation_runner.py \
--model_name_or_path "gpt-4o-mini-tts" \
--output_dir "<path_to_output_dir, absolute path is recommended>" \
--seed 42 \
--judge_model_provider "gemini-2.5-pro-preview-05-06" \
--api_num_threads 14 \
--temperature 1.0 \
--text_speech_strong_prompting \
--tts_judger_evaluate_function "win_rate" \
--baseline_audios_path <path/to/EmergentTTS-Eval/data/baseline_audios, these will be stored when you run download_data.py>

Local Inference with Accelerate

If the model is in model_clients.py, evaluation will be run with accelerate. For example, to reproduce Orpheus-TTS results, install it's specific requirements(snac is the only missing requirement for orpheus) and use the command:

MODEL_NAME="orpheus-tts-0.1-finetune-prod"
export HF_TOKEN=<your_hf_token>
export JUDGER_API_KEY=<api key for the judger model, either gemini or openai model>
accelerate launch --config_file default_accelerate_config.yaml evaluation_runner.py \
--model_name_or_path "canopylabs/orpheus-tts-0.1-finetune-prod" \
--output_dir "<path_to_output_dir, absolute path is recommended>" \
--seed 42 \
--judge_model_provider "gemini-2.5-pro-preview-05-06" \
--temperature 1.0 \
--tts_judger_evaluate_function "win_rate" \
--baseline_audios_path <path/to/EmergentTTS-Eval/data/baseline_audios, these will be stored when you run download_data.py>

Custom Judger Model

To use your own Judger Model, we recommend using vLLM to create an OpenAI compatible server. Next, you just have to pass the flag --judger_base_url pointing to the IP and port of the custom server.

Flags and Considerations

For single GPU setup, edit the default_accelerate_config.yaml file and use gpu_ids: 0, and num_processes: 1.
Use the --num_samples parameter to evaluate on only a subset of samples.
Carefully tweak --api_num_threads and CUDA_VISIBLE_DEVICES for api client based inference in your setup. We place --api_num_threads instances of the MOS model and WhisperV3 uniformly in all visible gpus, so too high --api_num_threads can result in Cuda OOM.
Omit passing the --tts_judger_evaluate_function if you do not want to use LALM judger to judge the output, and only calculate other metrics like WER, MOS, etc.
Pass --text_speech_strong_prompting if the client supports strong_prompting, as described in the paper.
To evaluate only on specific depths or categories, use the --depths_to_evaluate and --categories_to_evaluate flags.
To use a specific voice of the model, if applicable, use the --voice_to_use paramter.
--baseline_audios_path needs to be passed if win_rate is calculated.
To disable audio generation and fetch generated audios from a path, use the --fetch_audios_from_path paramter, this can be useful for testing different judger and not changing the generated audios.

If our work proves useful to you, cite it as:

@misc{manku2025emergentttseval,
    title={EmergentTTS-Eval: Evaluating TTS Models on Complex Prosodic, Expressiveness, and Linguistic Challenges Using Model-as-a-Judge},
    author={Ruskin Raj Manku and Yuzhi Tang and Xingjian Shi and Mu Li and Alex Smola},
    year={2025},
    eprint={2505.23009},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
human_alignment		human_alignment
quality_assessment		quality_assessment
wer_utils		wer_utils
LICENSE		LICENSE
README.md		README.md
api_clients.py		api_clients.py
default_accelerate_config.yaml		default_accelerate_config.yaml
download_data.py		download_data.py
evaluation_runner.py		evaluation_runner.py
inference.py		inference.py
model_clients.py		model_clients.py
prompts.py		prompts.py
requirements.txt		requirements.txt
utils_eval.py		utils_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EmergentTTS-Eval

Introduction

Leaderboard

WER ↓ and Win-rate ↑ over all categories with gpt-4o-mini-tts-alloy as baseline and gemini-2.5-pro as judger; ⭐ indicates that the model was strongly prompted(see paper for definition). For some expensive to use models, we only do strong prompting.

General Instructions

Setup Instructions

Evaluation Instructions

Inference for API-served models

Local Inference with Accelerate

Custom Judger Model

Flags and Considerations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

boson-ai/EmergentTTS-Eval-public

Folders and files

Latest commit

History

Repository files navigation

EmergentTTS-Eval

Introduction

Leaderboard

WER ↓ and Win-rate ↑ over all categories with gpt-4o-mini-tts-alloy as baseline and gemini-2.5-pro as judger; ⭐ indicates that the model was strongly prompted(see paper for definition). For some expensive to use models, we only do strong prompting.

General Instructions

Setup Instructions

Evaluation Instructions

Inference for API-served models

Local Inference with Accelerate

Custom Judger Model

Flags and Considerations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages