A Unified Framework for Comprehensive Evaluation of Audio Foundation Models
中文 | English | 💬Discord | UltraEval-Audio Paper
- Popular model replication: Added replication support for popular models, including replication result showcases and one-click replication commands (see
replication/).- Isolated Runtime: Introduced an isolated inference mechanism. Model-specific dependencies are installed/managed automatically; inference runs in the isolated environment and communicates with the main evaluation process via IPC, eliminating dependency conflicts.
- Specialized model evaluation support: Added specialized audio models for TTS, ASR, and Audio Codec, further expanding evaluation coverage.
UltraEval-Audio — The world's first open-source framework supporting both speech understanding and speech generation evaluation, specifically designed for large audio models. It aggregates 34 authoritative benchmarks, covering four major domains: speech, sound, medicine, and music, supporting 10 languages and 12 task categories. With UltraEval-Audio, you will experience unprecedented convenience and efficiency:
- Direct Replication of Popular Models 🔬: Provides detailed replication documentation and commands, ensuring you can easily reproduce evaluation results of open-source models with complete transparency and reproducibility.
- One-Click Benchmark Management 📥: Say goodbye to tedious manual downloading and data processing. UltraEval-Audio automates it all, letting you easily acquire well-known benchmark datasets (e.g., Librispeech, TED-LIUM, Seed-TTS-Eval).
- Built-in Evaluation Tools ⚙️: No need to hunt for evaluation tools. UltraEval-Audio binds datasets with commonly used official evaluation methods (e.g., WER, WER-ZH, BLEU, G-Eval) to ensure alignment between datasets and metrics.
- Powerful and Flexible 🛠️: Supports preview testing, random sampling, error retries, and resume-from-breakpoint, ensuring a flexible and controllable evaluation process while boosting efficiency and accuracy.
- Seamless Integration of Custom Datasets 💼: Supports not only public benchmarks but also powerful custom dataset integration, allowing rapid application in various engineering scenarios.
- Easy Integration with Existing Systems 🔗: With excellent extensibility and standardized design, UltraEval-Audio seamlessly connects with your existing evaluation pipelines, simplifying project management and unifying output results.
- [2025/12/31]
- release v1.1 🎉🎉🎉
- Add replication docs for popular models: CosyVoice2, CosyVoice3, GLM-TTS, IndexTTS2, VoxCPM
- support Isolated Runtime offline inference
- support TTS、ASR、Audio Codec specific task audio model
- release v1.1 🎉🎉🎉
- [2025/12/04]
- Support Qwen3-Omni, update Kimi-Audio
- [2025/12/02]
- 🌟 Added Replication Results and Command Documentation: To better support the open-source community, we have detailed the evaluation process and results of current open-source models, ensuring the evaluation process is completely transparent and reproducible.
- Support Long-TTS-Eval dataset, see alignment details in Long-TTS-Eval
- Support MGM-Omni TTS model, see alignment details in MGM-Omni
- [2025/10/30]
- Support VoxCPM TTS model:
--model voxcpm-tts--model voxcpm-vc - Use
uvto accelerate model dependency installation 🚀
- Support VoxCPM TTS model:
- [2025/10/17]
- [2025/05/22]
- [2025/05/12]
- Support Qwen2.5-Omni
qwen2.5-omni-audio, qwen2.5-omni-speech, Kimi-Audio-7B-Instructkimiaudio, kimiaudio-speechmodels, and update Audio Understanding Leaderboard
- Support Qwen2.5-Omni
- [2025/05/8]
- Faster resume evaluation,
-r/--resumeparameter, automatically searches for the latest breakpoint result if no file is specified - Support evaluation starting from inference file,
--infer-fileparameter, allows direct evaluation from inference file without regeneration
- Faster resume evaluation,
- [2025/03/23]
- Added support for step-audio model evaluation and ranking
- Ranking details: leaderboard.md
- Evaluation support: Step-Audio-Chat
- Added support for step-audio model evaluation and ranking
- [2025/03/04]
- Support [resume evaluation](docs/Procedures for Restarting an Incomplete Evaluation.md), command line parameter
--resume $checkpoint_res_file - glm-4-voice service deployment, supports UltraEval-Audio evaluation, see details at GLM-4-Voice
- Parallel evaluation support, command line parameter
--workers $num_workers
- Support [resume evaluation](docs/Procedures for Restarting an Incomplete Evaluation.md), command line parameter
- [2025/01/13] release v1.0
Audio Understanding Audio Foundation Models: Speech + Text → Text
WER/CER (
$\downarrow$ ) for ASR, BLEU ($\uparrow$ ) for AST, and ACC ($\uparrow$ ) for EMO. Best results are in bold.Scoring:
- Avg. Score (
$\uparrow$ ): mean of all available normalized metric scores. For WER/CER-based metrics we use ((100-\text{WER/CER})); for other metrics (e.g., BLEU/Acc.) we keep the original value.
| Model | ASR Librispeech dev-clean|dev-other test-clean|test-other |
ASR TED-LIUM |
ASR CV-15 en|zh |
ASR Aishell-1 |
ASR FLEURS |
ASR Wenet -test-net |
AST covost2-en2zh |
AST covost2-zh2en |
EMO MELD |
Avg. Score ( |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o-Realtime | 2.30|5.60 2.60|5.50 |
4.80 | 27.44|37.44 | 7.30 | 5.40 | 28.90 | 37.10 | 15.70 | 33.20 | 73.75 |
| Qwen3-Omni-30B-A3B-Instruct | 1.25|2.27 1.36|2.57 |
2.82 | 6.00|4.32 | 0.87 | 2.61 | 4.82 | 46.58 | 29.40 | 56.81 | 84.92 |
| Qwen2.5-Omni | 2.10|4.20 2.40|4.20 |
4.70 | 8.70|5.20 | 1.10 | 4.60 | 6.00 | 42.50 | 11.50 | 53.60 | 81.88 |
| MiniCPM-o 2.6 | 1.60|3.40 1.70|4.40 |
3.00 | 10.30|9.60 | 1.60 | 4.40 | 6.90 | 48.20 | 27.20 | 52.40 | 83.15 |
| Kimi-Audio-7B-Instruct |
1.18|2.34 1.28|2.44 |
2.96 | 7.09|5.72 | 0.60 | 2.53 | 5.55 | 36.61 | 18.30 | 59.23 | 83.27 |
| Gemini-1.5-Flash | 5.90|7.20 21.90|16.30 |
6.90 | 208.00|84.37 | 9.00 | 85.90 | 279.90 | 33.40 | 8.20 | 45.20 | 27.80 |
| Gemini-1.5-Pro | 2.60|4.40 2.90|4.90 |
3.00 | 8.36|13.26 | 4.50 | 5.90 | 14.30 | 47.30 | 22.60 | 48.40 | 81.09 |
| Gemini-2.5-Flash | 3.73|6.71 3.28|12.03 |
3.53 | 46.76|36.15 | 6.40 | 6.45 | 126.07 | 3.67 | 10.61 | 51.53 | 62.67 |
| Gemini-2.5-Pro | 5.30|4.51 2.84|6.74 |
2.52 | 9.42|11.04 | 3.36 | 4.25 | 16.83 | 41.75 | 27.84 | 46.59 | 80.72 |
| Qwen2-Audio-7B | 1.57|3.50 1.60|3.88 |
3.43 | 8.67|7.03 | 1.52 | 5.89 | 8.09 | 45.30 | 24.84 | 42.87 | 82.14 |
| Qwen2-Audio-7B-Instruct | 2.90|5.50 3.10|5.70 |
5.90 | 10.68|8.39 | 2.60 | 6.90 | 10.30 | 39.50 | 22.90 | 17.40 | 78.29 |
| MiDaShengLM-7B | 2.20|4.75 2.21|5.16 |
146.53 | 13.66|29.13 | 1.23 | 3.28 | 16.56 | 38.52 | 22.68 | 53.96 | 68.50 |
Audio Generation Audio Foundation Models: Speech → Speech Table: Audio generation performance (
$\uparrow$ ). Acoustic metrics (UTMOS | DNSMOS P.835 | DNSMOS P.808, scores range from 0 to 5) are evaluated on the generated audio responses from the speech tasks. Best results are in bold.Note: The average score is computed as the average of 6 scores: five speech-task scores and normalized acoustic scores. For acoustic scores (UTMOS | DNSMOS P.835 | DNSMOS P.808), each value (0--5) is multiplied by 20 to map to 0--100, then averaged to obtain the normalized acoustic score.
| Models | Speech WebQuestions |
Speech TriviaQA |
Speech AlpacaEval |
Speech CMMLU |
Speech HSK |
Acoustics | Avg. Score ( |
|---|---|---|---|---|---|---|---|
| GPT-4o-Realtime | 51.60 | 69.70 | 74.00 | 70.05 | 98.69 | 4.29|3.44|4.26 | 74.00 |
| Qwen3-Omni-30B-A3B-Instruct | 51.50 | 55.27 | 67.97 | 47.83 | 40.27 | 4.44|3.45|4.12 | 57.15 |
| Qwen2.5-Omni | 38.89 | 39.94 | 54.00 | 73.72 | 95.65 | 4.23|3.48|4.27 | 63.68 |
| MiniCPM-o 2.6 | 40.00 | 40.20 | 51.00 | 51.37 | 80.68 | 4.12|3.39|4.02 | 56.69 |
| Kimi-Audio-7B-Instruct | 33.69 | 38.20 | 34.40 | 71.25 | 97.42 | 2.94|3.22|3.62 | 56.69 |
| GLM-4-Voice | 32.00 | 36.40 | 51.00 | 52.61 | 71.06 | 4.21|3.46|4.07 | 53.56 |
Audio Codec: Speech → Speech. Table: Audio Codec Performance: ASR-WER (
$\downarrow$ ), ASR-CER ($\downarrow$ ), SIM ($\uparrow$ ), and Acoustics (UTMOS|DNSMOS P.835|DNSMOS P.808,$\uparrow$ ). Note: The hyphen (-) indicates that UTMOS is not applicable to Chinese speech (AISHELL-1). Best results are in bold.Note: For acoustic scores we use UTMOS, DNSMOS P.835, and DNSMOS P.808 metrics. To calculate the average score, for ASR-WER and ASR-CER, we calculate (100-\text{val}). For acoustic scores, each available value (ranges from 0 to 5) is normalized by (20\times\mathrm{val}) (mapping to 0--100), and the acoustic score is their average (the hyphen `-' is ignored). The final score is the average of 9 metric scores.
| Models | Librispeech-dev-clean ASR-WER |
Librispeech-dev-clean SIM |
Librispeech-dev-clean Acoustics |
Librispeech-test-clean ASR-WER |
Librispeech-test-clean SIM |
Librispeech-test-clean Acoustics |
AISHELL-1 ASR-CER |
AISHELL-1 SIM |
AISHELL-1 Acoustics |
Avg. Score ( |
|---|---|---|---|---|---|---|---|---|---|---|
| Encodec-24k | 4.56 | 59.40 | 1.58|3.12|2.36 | 4.32 | 59.40 | 1.57|3.12|2.36 | 13.95 | 47.48 | -|2.93|2.03 | 65.24 |
| Encodec-48k | 3.85 | 65.53 | 1.52|2.88|2.42 | 3.80 | 66.00 | 1.48|2.87|2.40 | 6.85 | 68.78 | -|2.79|2.21 | 69.59 |
| ChatTTS-DVAE | 7.49 | 34.83 | 1.30|2.66|2.11 | 6.75 | 36.21 | 1.29|2.64|2.12 | 32.36 | 32.36 | -|2.24|1.57 | 52.86 |
| Mimi (32bit) | 2.04 | 92.18 | 3.83|2.87|2.44 | 1.96 | 92.68 | 3.84|2.92|2.49 | 2.82 | 84.80 | -|2.43|1.89 | 80.96 |
| Mimi (8bit) | 2.76 | 72.15 | 3.52|2.78|2.37 | 2.83 | 73.13 | 3.53|2.83|2.43 | 6.82 | 60.63 | -|2.42|2.04 | 72.72 |
| Mimi-streaming (8bit) | 6.76 | 54.02 | 1.65|2.78|2.37 | 6.19 | 54.32 | 1.63|2.83|2.43 | 19.62 | 40.67 | -|2.42|2.04 | 61.37 |
| WavTokenizer-large-75 | 4.31 | 69.97 | 4.01|3.64|3.26 | 4.05 | 68.15 | 4.00|3.63|3.27 | 8.97 | 64.27 | -|3.11|2.85 | 76.67 |
| WavTokenizer-large-40 | 8.13 | 60.26 | 3.78|3.70|3.13 | 7.73 | 56.63 | 3.77|3.70|3.16 | 25.52 | 49.21 | -|3.13|2.50 | 69.18 |
| Spark | 2.39 | 79.94 | 4.18|3.85|3.24 | 2.53 | 79.53 | 4.18|3.83|3.24 | 3.66 | 74.76 | -|3.63|2.85 | 82.29 |
git clone https://github.com/OpenBMB/UltraEval-Audio.git
cd UltraEval-Audio
conda create -n env python=3.10 -y
conda activate env
pip install -e .
or use uv for faster installation:
uv venv env --python 3.10
source env/bin/activate
uv pip install -e .
# For some regions, you may need to set: export HF_ENDPOINT=https://hf-mirror.com
# Test MiniCPM-o 2.6 speech understanding capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --prompt mini-cpm-omni-asr-zh --model MiniCPMo2_6-audio
# Test MiniCPM-o 2.6 speech generation capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset llama-questions-s2t --model MiniCPMo2_6-speech
# Test GPT-4o-Realtime speech understanding capability
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset sample --model gpt4o_audio
# Test GPT-4o-Realtime speech generation capability
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset llama-questions-s2t --model gpt4o_speech
# Test gemini-1.5-pro speech understanding capability
export GOOGLE_API_KEY=$your-key
python audio_evals/main.py --dataset sample --model gemini-pro
# Test qwen2-audio-offline speech understanding capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --model qwen2-audio-chat
If you encounter errors or cannot reproduce Mini-CPM-o 2.6 results, please check FAQ.
Evaluation complete, results are as follows:
- res
|-- $model-name
|-- $dataset
|-- $time.jsonl
|-- $time-overview.jsonl
Evaluation command:
python audio_evals/main.py --dataset <dataset_name> --model <model_name>
<dataset_name> specifies the dataset to evaluate. Supported datasets can be viewed via python cli/list_availabel.py
Construct your own dataset: docs/how add a dataset.md
model_name specifies the model to evaluate. Supported models can be viewed via python cli/list_availabel.py
Evaluate your own model docs/how eval your model.md
If you have any suggestions or questions, please file an issue or join our Discord group: https://discord.com/invite/Qrsbft4e
If you find UltraEval-Audio helpful, please consider citing our paper 📝 and staring us ⭐️!
@article{ultraevalaudio,
title={UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models},
author={Qundong Shi and Jie Zhou and Biyuan Lin and Junbo Cui and Guoyang Zeng and Yixuan Zhou and Ziyang Wang and Xin Liu and Zhen Luo and Yudong Wang and Zhiyuan Liu},
year={2026},
eprint={2601.01373},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2601.01373},
}

