S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the “intelligence degradation” problem in speech-to-speech large language models (Speech LLMs), wherein model reasoning and generation capabilities substantially deteriorate under audio inputs. To this end, we introduce S2SBench—the first dedicated evaluation benchmark for Speech LLMs. Methodologically, we propose a pairwise perplexity assessment protocol grounded in plausibility comparison, design diagnostic audio-semantic tasks covering sentence continuation and commonsense reasoning, and characterize degradation trajectories via speech token modeling and training dynamics analysis. Our experiments provide the first systematic quantification of intelligence degradation across training stages of Baichuan-Audio, uncovering consistent performance decay patterns induced by speech input. All datasets and evaluation code are publicly released, establishing foundational infrastructure for trustworthy, standardized evaluation of speech-based LLMs.

Technology Category

Application Category

📝 Abstract

End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.

Problem

Research questions and friction points this paper is trying to address.

Measure intelligence degradation in Speech-to-Speech LLMs

Evaluate performance gap between audio and text inputs

Assess reasoning and generation decline in Speech LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes S2SBench benchmark for Speech LLMs

Includes diagnostic datasets for audio tasks

Introduces pairwise perplexity evaluation protocol

🔎 Similar Papers

No similar papers found.

Authors to Follow