VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Speech Large Language Models (Speech-LLMs) exhibit insufficient robustness to disfluent speech—such as that caused by Parkinson’s disease—limiting their deployment in real-world interactive applications. To address this, we introduce VocalBench-DF, the first multidimensional benchmark explicitly designed for evaluating Speech-LLMs on disfluent speech, covering phoneme-level recognition, multidimensional semantic classification, long-context understanding, and component-wise ablation analysis. Using VocalBench-DF, we systematically evaluate 22 state-of-the-art Speech-LLMs, identifying two critical bottlenecks: phoneme-level recognition bias and failure in long-range reasoning. Experiments reveal substantial performance degradation on disfluent speech across all models. Further analysis demonstrates that enhancing robust speech recognition and structured reasoning capabilities significantly improves overall adaptability. This work establishes a reproducible evaluation paradigm and actionable improvement pathways for clinical adaptation and accessible human–AI interaction with Speech-LLMs.

Technology Category

Application Category

📝 Abstract
While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson's disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs
Problem

Research questions and friction points this paper is trying to address.

Evaluating Speech-LLM robustness to speech disfluency
Assessing performance degradation with impaired speech inputs
Identifying bottlenecks in phoneme processing and context modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed VocalBench-DF framework for disfluency evaluation
Identified phoneme-level processing as key performance bottleneck
Enhanced robustness through component and pipeline improvements
🔎 Similar Papers
No similar papers found.
H
Hongcheng Liu
Shanghai Jiao Tong University
Y
Yixuan Hou
Shanghai Jiao Tong University
Heyang Liu
Heyang Liu
Shanghai Jiao Tong University
ASRMultimodal understanding
Y
Yuhao Wang
Shanghai Jiao Tong University
Yanfeng Wang
Yanfeng Wang
Shanghai Jiao Tong University
Y
Yu Wang
Shanghai Jiao Tong University