Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This study investigates the root causes of the performance gap between speech and text inputs in end-to-end spoken large language models. Through cross-layer centered kernel alignment (CKA) analysis and speech–text token alignment, combined with evaluations on SpeechMMLU and VoiceBench BBH across four open-source models, the authors find that speech representations form broad alignment bands across layers. This suggests that the modality gap stems primarily from the difficulty of compressing redundant acoustic information into stable high-level semantic representations, rather than mere distributional shift. Furthermore, statistical calibration at the input layer proves ineffective or even detrimental, reinforcing the structural stability of the observed alignment patterns. These findings provide theoretical grounding for future modeling approaches operating at the token or temporal granularity.

Technology Category

Application Category

📝 Abstract

Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.

Problem

Research questions and friction points this paper is trying to address.

modality gap

speech-language models

speech representation

text inference

end-to-end models

Innovation

Methods, ideas, or system contributions that make the work stand out.

modality gap

speech-language models

cross-layer alignment