UltraEval-Audio: A Unified Framework for Comprehensive Evaluation of Audio Foundation Models

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a unified and systematic evaluation framework for audio foundation models, particularly the notable gaps in codec assessment and Chinese speech benchmarking. To this end, we propose the first modular and unified evaluation framework tailored for audio foundation models, which systematically evaluates codec performance along three dimensions: semantic accuracy, voice fidelity, and acoustic quality. We also introduce two new Chinese speech understanding benchmarks—SpeechCMMLU and SpeechHSK. The framework integrates 24 mainstream models and 36 authoritative benchmarks, supports 10 languages and 14 core tasks, and provides one-click automated evaluation with real-time leaderboards, significantly enhancing the efficiency, fairness, and transparency of cross-model comparisons.

Technology Category

Application Category

📝 Abstract
The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross-model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models'performance on Chinese. To address the first issue, we introduce UltraEval-Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and generation tasks. UltraEval-Audio features a modular architecture, supporting 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks. To enhance research efficiency, the framework provides a one-command evaluation feature, accompanied by real-time public leaderboards. For the second challenge, UltraEval-Audio adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions: semantic accuracy, timbre fidelity, and acoustic quality. To address the third issue, we propose two new Chinese benchmarks, SpeechCMMLU and SpeechHSK, designed to assess Chinese knowledge proficiency and language fluency. We wish that UltraEval-Audio will provide both academia and industry with a transparent, efficient, and fair platform for comparison of audio models. Our code, benchmarks, and leaderboards are available at https://github.com/OpenBMB/UltraEval-Audio.
Problem

Research questions and friction points this paper is trying to address.

audio foundation models
evaluation framework
audio codecs
multilingual benchmark
Chinese speech evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Evaluation Framework
Audio Codec Assessment
Multilingual Audio Benchmarking
Chinese Speech Evaluation
Audio Foundation Models
🔎 Similar Papers
No similar papers found.
Q
Qundong Shi
ModelBest Inc.
J
Jie Zhou
ModelBest Inc.
B
Biyuan Lin
ModelBest Inc.
Junbo Cui
Junbo Cui
Tsinghua University
G
Guoyang Zeng
ModelBest Inc.
Y
Yixuan Zhou
ModelBest Inc.
Z
Ziyang Wang
ModelBest Inc.
X
Xin Liu
ModelBest Inc.
Z
Zhen Luo
ModelBest Inc.
Y
Yudong Wang
Tsinghua University
Zhiyuan Liu
Zhiyuan Liu
Tsinghua University
autonomous drivingtraffic simulation