Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Speech language models (SLMs) suffer from a substantial modality gap between speech and text representations and exhibit weak cross-dataset generalization. To address this, we propose Optimal Transport Regularization (OTR), which formulates speech–text embedding alignment as an unsupervised optimal transport problem. OTR introduces a regularization loss that establishes structured cross-modal correspondences in the embedding space—without requiring additional annotations or learnable parameters. Crucially, it is parameter-free and architecture-agnostic, seamlessly integrating with any ASR model. Evaluated on multilingual ASR benchmarks, OTR significantly improves speech–text alignment accuracy, reducing average word error rate by 2.1%. Moreover, it enhances out-of-domain generalization by up to 14.3%, demonstrating its effectiveness in mitigating the speech–text modality gap.

Technology Category

Application Category

📝 Abstract

Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets.

Problem

Research questions and friction points this paper is trying to address.

SLMs struggle to generalize across datasets due to speech-text modality gap

High variability in speech embeddings hinders SLM generalization performance

Optimal Transport Regularization improves speech-text alignment in SLM training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimal Transport Regularization for speech-text alignment

Lightweight method with no extra parameters

Improves SLM generalization across datasets

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models