Closing the Modality Reasoning Gap for Speech Large Language Models

πŸ“… 2026-01-09
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the significant performance gap in reasoning capabilities of large speech language models (SLMs) between spoken and textual inputsβ€”a phenomenon termed modality reasoning gap. To bridge this gap, the authors propose the TARS framework, which introduces a novel dual-signal reinforcement learning mechanism that jointly enforces representation alignment and behavioral alignment. By leveraging asymmetric rewards, TARS guides the model to produce consistent reasoning trajectories under both speech and text conditions. Built upon a Transformer-based SLM architecture, the method integrates layer-wise hidden state similarity metrics with semantic consistency evaluation. Experiments on reasoning benchmarks such as MMSU and OBQA demonstrate that TARS substantially narrows the modality gap and achieves state-of-the-art performance on 7B-scale models.

Technology Category

Application Category

πŸ“ Abstract
Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.
Problem

Research questions and friction points this paper is trying to address.

modality reasoning gap
speech large language models
reasoning performance
representational drift
long-chain reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

modality reasoning gap
speech large language models
reinforcement learning
representation alignment
behavior alignment
πŸ”Ž Similar Papers
No similar papers found.