Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the “modality gap”—the significant performance disparity between speech and text inputs—in Large Speech-Language Models (LSLMs). To overcome the limitation of existing alignment mechanisms, which lack fine-grained characterization, we conduct the first systematic empirical analysis and propose a representation-level quantitative metric, the Alignment Path Score. We further design a token-level representation intervention method based on angular projection and length normalization. Using cosine similarity, Euclidean distance, hierarchical representation analysis, and alignment path modeling, we reveal a strong correlation between the modality gap and representation similarity. Experiments demonstrate that our approach substantially narrows the performance gap under speech input, improving model accuracy across multiple downstream tasks. Our work establishes an interpretable and intervenable paradigm for speech–text cross-modal alignment, advancing both diagnostic understanding and controllable representation learning in LSLMs.

Technology Category

Application Category

📝 Abstract

End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we analyze both coarse- and fine-grained text and speech representations. At the coarse-grained level, representations of speech and text in deeper layers are found to be increasingly aligned in direction (cosine similarity), while concurrently diverging in magnitude (Euclidean distance). We further find that representation similarity is strongly correlated with the modality gap. At the fine-grained level, a spontaneous token-level alignment pattern between text and speech representations is observed. Based on this, we introduce the Alignment Path Score to quantify token-level alignment quality, which exhibits stronger correlation with the modality gap. Building on these insights, we design targeted interventions on critical tokens through angle projection and length normalization. These strategies demonstrate the potential to improve correctness for speech inputs. Our study provides the first systematic empirical analysis of the modality gap and alignment mechanisms in LSLMs, offering both theoretical and methodological guidance for future optimization.

Problem

Research questions and friction points this paper is trying to address.

Analyzing performance gap between speech and text inputs in large language models

Investigating representation alignment patterns across different model layers

Developing interventions to improve speech input correctness through alignment mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing coarse- and fine-grained speech-text representations

Introducing Alignment Path Score to quantify token-level alignment

Applying angle projection and length normalization interventions

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models