🤖 AI Summary
Existing end-to-end spoken language models (SLMs) lack rigorous evaluation of knowledge understanding in realistic speech interactions. Method: We introduce VoxEval, the first purely speech-based question-answering benchmark, featuring diverse speaker identities, realistic noise conditions, variable speaking rates, and spoken mathematical reasoning—evaluated under a full-speech input–output paradigm. The benchmark comprises a multi-condition audio QA dataset built from both real and synthetic speech, augmented with acoustic perturbations and analyzed along dual dimensions: acoustic robustness and semantic complexity. Contribution/Results: Experiments reveal that state-of-the-art SLMs achieve only ~40% average accuracy on VoxEval, dropping below 25% on spoken mathematical reasoning tasks. These results expose critical bottlenecks in robust speech representation learning and joint speech–semantic reasoning, highlighting key directions for advancing SLMs toward reliable real-world deployment.
📝 Abstract
With the growing demand for developing speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. When engaging in conversations with humans, it is essential for these models to comprehend a wide range of world knowledge. In this paper, we introduce VoxEval, a novel speech question-answering benchmark specifically designed to assess SLMs' knowledge understanding through purely speech-based interactions. Unlike existing AudioQA benchmarks, VoxEval maintains speech format for both questions and answers, evaluates model robustness across diverse audio conditions (varying timbres, audio qualities, and speaking styles), and pioneers the assessment of challenging domains like mathematical problem-solving in spoken format. Our comprehensive evaluation of recent SLMs using VoxEval reveals significant performance limitations in current models, highlighting crucial areas for future improvements.