🤖 AI Summary
This paper addresses the “intelligence degradation” problem in speech-to-speech large language models (Speech LLMs), wherein model reasoning and generation capabilities substantially deteriorate under audio inputs. To this end, we introduce S2SBench—the first dedicated evaluation benchmark for Speech LLMs. Methodologically, we propose a pairwise perplexity assessment protocol grounded in plausibility comparison, design diagnostic audio-semantic tasks covering sentence continuation and commonsense reasoning, and characterize degradation trajectories via speech token modeling and training dynamics analysis. Our experiments provide the first systematic quantification of intelligence degradation across training stages of Baichuan-Audio, uncovering consistent performance decay patterns induced by speech input. All datasets and evaluation code are publicly released, establishing foundational infrastructure for trustworthy, standardized evaluation of speech-based LLMs.
📝 Abstract
End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.