Skip to content

Your faithful, impartial partner for audio evaluation — know yourself, know your rivals. 真实评测,知己知彼。

License

Notifications You must be signed in to change notification settings

OpenBMB/UltraEval-Audio

Repository files navigation

assets/logo.png

A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

中文 | English | 💬Discord | UltraEval-Audio Paper

v1.1 Highlights

  • Popular model replication: Added replication support for popular models, including replication result showcases and one-click replication commands (see replication/).
  • Isolated Runtime: Introduced an isolated inference mechanism. Model-specific dependencies are installed/managed automatically; inference runs in the isolated environment and communicates with the main evaluation process via IPC, eliminating dependency conflicts.
  • Specialized model evaluation support: Added specialized audio models for TTS, ASR, and Audio Codec, further expanding evaluation coverage.

Overview

🚀Exceptional Experience with UltraEval-Audio🚀

UltraEval-Audio — The world's first open-source framework supporting both speech understanding and speech generation evaluation, specifically designed for large audio models. It aggregates 34 authoritative benchmarks, covering four major domains: speech, sound, medicine, and music, supporting 10 languages and 12 task categories. With UltraEval-Audio, you will experience unprecedented convenience and efficiency:

  • Direct Replication of Popular Models 🔬: Provides detailed replication documentation and commands, ensuring you can easily reproduce evaluation results of open-source models with complete transparency and reproducibility.
  • One-Click Benchmark Management 📥: Say goodbye to tedious manual downloading and data processing. UltraEval-Audio automates it all, letting you easily acquire well-known benchmark datasets (e.g., Librispeech, TED-LIUM, Seed-TTS-Eval).
  • Built-in Evaluation Tools ⚙️: No need to hunt for evaluation tools. UltraEval-Audio binds datasets with commonly used official evaluation methods (e.g., WER, WER-ZH, BLEU, G-Eval) to ensure alignment between datasets and metrics.
  • Powerful and Flexible 🛠️: Supports preview testing, random sampling, error retries, and resume-from-breakpoint, ensuring a flexible and controllable evaluation process while boosting efficiency and accuracy.
  • Seamless Integration of Custom Datasets 💼: Supports not only public benchmarks but also powerful custom dataset integration, allowing rapid application in various engineering scenarios.
  • Easy Integration with Existing Systems 🔗: With excellent extensibility and standardized design, UltraEval-Audio seamlessly connects with your existing evaluation pipelines, simplifying project management and unifying output results.

UEA_Architecture

Changelog🔥

  • [2025/12/31]
    • release v1.1 🎉🎉🎉
  • [2025/12/04]
  • [2025/12/02]
  • [2025/10/30]
    • Support VoxCPM TTS model: --model voxcpm-tts --model voxcpm-vc
    • Use uv to accelerate model dependency installation 🚀
  • [2025/10/17]
  • [2025/05/22]
  • [2025/05/12]
    • Support Qwen2.5-Omni qwen2.5-omni-audio, qwen2.5-omni-speech, Kimi-Audio-7B-Instruct kimiaudio, kimiaudio-speech models, and update Audio Understanding Leaderboard
  • [2025/05/8]
    • Faster resume evaluation, -r/--resume parameter, automatically searches for the latest breakpoint result if no file is specified
    • Support evaluation starting from inference file, --infer-file parameter, allows direct evaluation from inference file without regeneration
  • [2025/03/23]
  • [2025/03/04]
    • Support [resume evaluation](docs/Procedures for Restarting an Incomplete Evaluation.md), command line parameter --resume $checkpoint_res_file
    • glm-4-voice service deployment, supports UltraEval-Audio evaluation, see details at GLM-4-Voice
    • Parallel evaluation support, command line parameter --workers $num_workers
  • [2025/01/13] release v1.0

Leaderboard

Audio Understanding Leaderboard

Audio Understanding Audio Foundation Models: Speech + Text → Text

WER/CER ($\downarrow$) for ASR, BLEU ($\uparrow$) for AST, and ACC ($\uparrow$) for EMO. Best results are in bold.

Scoring:

  • Avg. Score ($\uparrow$): mean of all available normalized metric scores. For WER/CER-based metrics we use ((100-\text{WER/CER})); for other metrics (e.g., BLEU/Acc.) we keep the original value.
Model ASR
Librispeech
dev-clean|dev-other
test-clean|test-other
ASR
TED-LIUM
ASR
CV-15
en|zh
ASR
Aishell-1
ASR
FLEURS
ASR
Wenet
-test-net
AST
covost2-en2zh
AST
covost2-zh2en
EMO
MELD
Avg. Score
($\uparrow$)
GPT-4o-Realtime 2.30|5.60
2.60|5.50
4.80 27.44|37.44 7.30 5.40 28.90 37.10 15.70 33.20 73.75
Qwen3-Omni-30B-A3B-Instruct 1.25|2.27
1.36|2.57
2.82 6.00|4.32 0.87 2.61 4.82 46.58 29.40 56.81 84.92
Qwen2.5-Omni 2.10|4.20
2.40|4.20
4.70 8.70|5.20 1.10 4.60 6.00 42.50 11.50 53.60 81.88
MiniCPM-o 2.6 1.60|3.40
1.70|4.40
3.00 10.30|9.60 1.60 4.40 6.90 48.20 27.20 52.40 83.15
Kimi-Audio-7B-Instruct 1.18|2.34
1.28|2.44
2.96 7.09|5.72 0.60 2.53 5.55 36.61 18.30 59.23 83.27
Gemini-1.5-Flash 5.90|7.20
21.90|16.30
6.90 208.00|84.37 9.00 85.90 279.90 33.40 8.20 45.20 27.80
Gemini-1.5-Pro 2.60|4.40
2.90|4.90
3.00 8.36|13.26 4.50 5.90 14.30 47.30 22.60 48.40 81.09
Gemini-2.5-Flash 3.73|6.71
3.28|12.03
3.53 46.76|36.15 6.40 6.45 126.07 3.67 10.61 51.53 62.67
Gemini-2.5-Pro 5.30|4.51
2.84|6.74
2.52 9.42|11.04 3.36 4.25 16.83 41.75 27.84 46.59 80.72
Qwen2-Audio-7B 1.57|3.50
1.60|3.88
3.43 8.67|7.03 1.52 5.89 8.09 45.30 24.84 42.87 82.14
Qwen2-Audio-7B-Instruct 2.90|5.50
3.10|5.70
5.90 10.68|8.39 2.60 6.90 10.30 39.50 22.90 17.40 78.29
MiDaShengLM-7B 2.20|4.75
2.21|5.16
146.53 13.66|29.13 1.23 3.28 16.56 38.52 22.68 53.96 68.50

Audio Generation Leaderboard

Audio Generation Audio Foundation Models: Speech → Speech Table: Audio generation performance ($\uparrow$). Acoustic metrics (UTMOS | DNSMOS P.835 | DNSMOS P.808, scores range from 0 to 5) are evaluated on the generated audio responses from the speech tasks. Best results are in bold.

Note: The average score is computed as the average of 6 scores: five speech-task scores and normalized acoustic scores. For acoustic scores (UTMOS | DNSMOS P.835 | DNSMOS P.808), each value (0--5) is multiplied by 20 to map to 0--100, then averaged to obtain the normalized acoustic score.

Models Speech
WebQuestions
Speech
TriviaQA
Speech
AlpacaEval
Speech
CMMLU
Speech
HSK
Acoustics Avg. Score
($\uparrow$)
GPT-4o-Realtime 51.60 69.70 74.00 70.05 98.69 4.29|3.44|4.26 74.00
Qwen3-Omni-30B-A3B-Instruct 51.50 55.27 67.97 47.83 40.27 4.44|3.45|4.12 57.15
Qwen2.5-Omni 38.89 39.94 54.00 73.72 95.65 4.23|3.48|4.27 63.68
MiniCPM-o 2.6 40.00 40.20 51.00 51.37 80.68 4.12|3.39|4.02 56.69
Kimi-Audio-7B-Instruct 33.69 38.20 34.40 71.25 97.42 2.94|3.22|3.62 56.69
GLM-4-Voice 32.00 36.40 51.00 52.61 71.06 4.21|3.46|4.07 53.56

Audio Codec Leaderboard

Audio Codec: Speech → Speech. Table: Audio Codec Performance: ASR-WER ($\downarrow$), ASR-CER ($\downarrow$), SIM ($\uparrow$), and Acoustics (UTMOS|DNSMOS P.835|DNSMOS P.808, $\uparrow$). Note: The hyphen (-) indicates that UTMOS is not applicable to Chinese speech (AISHELL-1). Best results are in bold.

Note: For acoustic scores we use UTMOS, DNSMOS P.835, and DNSMOS P.808 metrics. To calculate the average score, for ASR-WER and ASR-CER, we calculate (100-\text{val}). For acoustic scores, each available value (ranges from 0 to 5) is normalized by (20\times\mathrm{val}) (mapping to 0--100), and the acoustic score is their average (the hyphen `-' is ignored). The final score is the average of 9 metric scores.

Models Librispeech-dev-clean
ASR-WER
Librispeech-dev-clean
SIM
Librispeech-dev-clean
Acoustics
Librispeech-test-clean
ASR-WER
Librispeech-test-clean
SIM
Librispeech-test-clean
Acoustics
AISHELL-1
ASR-CER
AISHELL-1
SIM
AISHELL-1
Acoustics
Avg. Score
($\uparrow$)
Encodec-24k 4.56 59.40 1.58|3.12|2.36 4.32 59.40 1.57|3.12|2.36 13.95 47.48 -|2.93|2.03 65.24
Encodec-48k 3.85 65.53 1.52|2.88|2.42 3.80 66.00 1.48|2.87|2.40 6.85 68.78 -|2.79|2.21 69.59
ChatTTS-DVAE 7.49 34.83 1.30|2.66|2.11 6.75 36.21 1.29|2.64|2.12 32.36 32.36 -|2.24|1.57 52.86
Mimi (32bit) 2.04 92.18 3.83|2.87|2.44 1.96 92.68 3.84|2.92|2.49 2.82 84.80 -|2.43|1.89 80.96
Mimi (8bit) 2.76 72.15 3.52|2.78|2.37 2.83 73.13 3.53|2.83|2.43 6.82 60.63 -|2.42|2.04 72.72
Mimi-streaming (8bit) 6.76 54.02 1.65|2.78|2.37 6.19 54.32 1.63|2.83|2.43 19.62 40.67 -|2.42|2.04 61.37
WavTokenizer-large-75 4.31 69.97 4.01|3.64|3.26 4.05 68.15 4.00|3.63|3.27 8.97 64.27 -|3.11|2.85 76.67
WavTokenizer-large-40 8.13 60.26 3.78|3.70|3.13 7.73 56.63 3.77|3.70|3.16 25.52 49.21 -|3.13|2.50 69.18
Spark 2.39 79.94 4.18|3.85|3.24 2.53 79.53 4.18|3.83|3.24 3.66 74.76 -|3.63|2.85 82.29

Quick Start

Environment Preparation

git clone https://github.com/OpenBMB/UltraEval-Audio.git
cd UltraEval-Audio
conda create -n env python=3.10 -y
conda activate env
pip install -e .

or use uv for faster installation:

uv venv env --python 3.10
source env/bin/activate
uv pip install -e .

Run Examples

# For some regions, you may need to set: export HF_ENDPOINT=https://hf-mirror.com
# Test MiniCPM-o 2.6 speech understanding capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --prompt mini-cpm-omni-asr-zh --model MiniCPMo2_6-audio

# Test MiniCPM-o 2.6 speech generation capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset llama-questions-s2t --model MiniCPMo2_6-speech

# Test GPT-4o-Realtime speech understanding capability
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset sample --model gpt4o_audio

# Test GPT-4o-Realtime speech generation capability
export OPENAI_API_KEY=$your-key
python audio_evals/main.py --dataset llama-questions-s2t --model gpt4o_speech

# Test gemini-1.5-pro speech understanding capability
export GOOGLE_API_KEY=$your-key
python audio_evals/main.py --dataset sample --model gemini-pro


# Test qwen2-audio-offline speech understanding capability
CUDA_VISIBLE_DEVICES=0 python audio_evals/main.py --dataset sample --model qwen2-audio-chat

If you encounter errors or cannot reproduce Mini-CPM-o 2.6 results, please check FAQ.

Res

Evaluation complete, results are as follows:

- res
    |-- $model-name
        |-- $dataset
            |-- $time.jsonl
            |-- $time-overview.jsonl

Usage

Evaluation command:

python audio_evals/main.py --dataset <dataset_name> --model <model_name>

Dataset Selection

<dataset_name> specifies the dataset to evaluate. Supported datasets can be viewed via python cli/list_availabel.py

Construct your own dataset: docs/how add a dataset.md

Model Selection

model_name specifies the model to evaluate. Supported models can be viewed via python cli/list_availabel.py

Evaluate your own model docs/how eval your model.md

Contact Us

If you have any suggestions or questions, please file an issue or join our Discord group: https://discord.com/invite/Qrsbft4e

Citation

If you find UltraEval-Audio helpful, please consider citing our paper 📝 and staring us ⭐️!

@article{ultraevalaudio,
      title={UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models},
      author={Qundong Shi and Jie Zhou and Biyuan Lin and Junbo Cui and Guoyang Zeng and Yixuan Zhou and Ziyang Wang and Xin Liu and Zhen Luo and Yudong Wang and Zhiyuan Liu},
      year={2026},
      eprint={2601.01373},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.01373},
}