VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high latency and exposure bias inherent in autoregressive speech large language models, which hinder real-time streaming interaction due to their sequential generation nature. To overcome these limitations, the authors propose a non-autoregressive streaming speech language model based on Masked Diffusion Modeling (MDM). The approach introduces a novel hierarchical block masking mechanism to align training and inference dynamics and incorporates iterative self-distillation to compress multi-step optimization into few-step inference. Trained on only 6K hours of data, the model achieves a 3.7–10× decoding speedup and a 34% reduction in first-chunk latency while preserving high speech recognition accuracy, text quality, and naturalness of generated audio.

Technology Category

Application Category

📝 Abstract
Recent Speech Large Language Models~(LLMs) have achieved impressive capabilities in end-to-end speech interaction. However, the prevailing autoregressive paradigm imposes strict serial constraints, limiting generation efficiency and introducing exposure bias. In this paper, we investigate Masked Diffusion Modeling~(MDM) as a non-autoregressive paradigm for speech LLMs and introduce VocalNet-MDM. To adapt MDM for streaming speech interaction, we address two critical challenges: training-inference mismatch and iterative overhead. We propose Hierarchical Block-wise Masking to align training objectives with the progressive masked states encountered during block diffusion decoding, and Iterative Self-Distillation to compress multi-step refinement into fewer steps for low-latency inference. Trained on a limited scale of only 6K hours of speech data, VocalNet-MDM achieves a 3.7$\times$--10$\times$ decoding speedup and reduces first-chunk latency by 34\% compared to AR baselines. It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness, demonstrating that MDM is a promising and scalable alternative for low-latency, efficient speech LLMs.
Problem

Research questions and friction points this paper is trying to address.

Speech LLM
autoregressive paradigm
streaming speech interaction
generation efficiency
exposure bias
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Diffusion Modeling
Non-autoregressive Speech LLM
Hierarchical Block-wise Masking
Iterative Self-Distillation
Streaming Speech Generation
🔎 Similar Papers
No similar papers found.