Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the low efficiency and poor alignment quality in speech-text joint decoding for speech-language models, this paper proposes Early-Stopping Interleaved (ESI) decoding: dynamically truncating redundant speech tokens during interleaved generation to jointly optimize inference speed and alignment accuracy. The method builds upon a unified speech-text joint modeling architecture, employs a speech tokenizer with a shared vocabulary, and leverages a high-quality speech question-answering (QA) fine-tuning dataset. Experiments show that ESI achieves over 40% speedup compared to standard interleaved decoding, with marginal improvements in word error rate (WER) and ASR-QA accuracy; it attains state-of-the-art performance on speech QA and reduces alignment error by 12%. The core contribution is the first integration of an early-stopping mechanism into speech-text interleaved decoding—effectively balancing decoding latency, alignment fidelity, and end-to-end task performance.

Technology Category

Application Category

📝 Abstract

Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleaved, and parallel generation paradigms-under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

Problem

Research questions and friction points this paper is trying to address.

Comparing speech-text joint decoding paradigms for efficiency and alignment

Addressing slow inference in interleaved speech-text decoding approaches

Enhancing speech QA performance with curated high-quality datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved speech-text decoding for best alignment

Early-stop interleaved pattern for faster decoding

High-quality QA datasets for improved performance

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models