WildSpeech-Bench: Benchmarking Audio LLMs in Natural Speech Conversation

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio large language models (Audio-LLMs) are predominantly evaluated using text-based benchmarks, overlooking speech-specific challenges—including prosody, homophones, and disfluencies—and lacking end-to-end, scenario-aware, high-fidelity evaluation protocols. Method: We introduce SpeechEval, the first end-to-end benchmark for realistic spoken dialogue, encompassing diverse speaker attributes, acoustic environments, and spontaneous speech phenomena. We propose a query-aware automated evaluation framework integrating domain-specific checklists and prompt engineering to enhance assessment accuracy, and enable fine-grained, multi-scenario performance attribution. Contribution/Results: Comprehensive evaluation of mainstream Audio-LLMs on SpeechEval reveals pronounced scenario dependency in speech understanding tasks. This work establishes a reproducible, standardized, and high-fidelity evaluation paradigm for audio LLM development and assessment.

Technology Category

Application Category

📝 Abstract
Recent multi-modal Large Language Models (LLMs) such as GPT-4o have demonstrated strong capabilities of direct speech interaction. However, the lack of specialized and comprehensive benchmarks for end-to-end speech LLM evaluation hinders optimizing the user experience of Audio LLMs in real-world applications. Existing evaluation methods often adapt text-based benchmarks, overlooking speech's unique characteristics and challenges, including prosody, homophones, stuttering, and differing user expectations. Here, we present a novel approach to thoroughly evaluate LLMs in practical speech conversations. We systematically curate real-world chat data relevant to spoken scenarios, introduce diversity in speaker attributes and acoustic conditions, and augment the dataset with speech-specific phenomena. We further design a query-aware evaluation method to use customized evaluation checklists and prompts to enhance the accuracy of automatic evaluation. We conduct comprehensive testing and detailed analysis of various mainstream speech models, revealing significant differences in model performance across different speech scenarios. The use of query-aware evaluation further enables a finer-grained assessment under various speech-specific scenarios. Our benchmark can provide valuable insights for speech model development and evaluation.
Problem

Research questions and friction points this paper is trying to address.

Lack of specialized benchmarks for speech LLM evaluation
Overlooking speech's unique characteristics and challenges
Need for accurate evaluation in diverse speech scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically curate real-world speech chat data
Augment dataset with speech-specific phenomena
Design query-aware evaluation method
🔎 Similar Papers
No similar papers found.
J
Jian Zhang
Pattern Recognition Center, WeChat AI, Tencent Inc, China
L
Linhao Zhang
Pattern Recognition Center, WeChat AI, Tencent Inc, China
B
Bokai Lei
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Chuhan Wu
Chuhan Wu
WeChat AI, Tencent
Foundation ModelPretrainingPost TrainingLLM Agent
W
Wei Jia
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Xiao Zhou
Xiao Zhou
M.Phil student in HKUST
Autonomous DrivingDRL