🤖 AI Summary
Speech language models (SLMs) suffer from a substantial modality gap between speech and text representations and exhibit weak cross-dataset generalization. To address this, we propose Optimal Transport Regularization (OTR), which formulates speech–text embedding alignment as an unsupervised optimal transport problem. OTR introduces a regularization loss that establishes structured cross-modal correspondences in the embedding space—without requiring additional annotations or learnable parameters. Crucially, it is parameter-free and architecture-agnostic, seamlessly integrating with any ASR model. Evaluated on multilingual ASR benchmarks, OTR significantly improves speech–text alignment accuracy, reducing average word error rate by 2.1%. Moreover, it enhances out-of-domain generalization by up to 14.3%, demonstrating its effectiveness in mitigating the speech–text modality gap.
📝 Abstract
Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets.