VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the high latency and exposure bias inherent in autoregressive speech large language models, which hinder real-time streaming interaction due to their sequential generation nature. To overcome these limitations, the authors propose a non-autoregressive streaming speech language model based on Masked Diffusion Modeling (MDM). The approach introduces a novel hierarchical block masking mechanism to align training and inference dynamics and incorporates iterative self-distillation to compress multi-step optimization into few-step inference. Trained on only 6K hours of data, the model achieves a 3.7–10× decoding speedup and a 34% reduction in first-chunk latency while preserving high speech recognition accuracy, text quality, and naturalness of generated audio.

Technology Category

Application Category

📝 Abstract

Recent Speech Large Language Models~(LLMs) have achieved impressive capabilities in end-to-end speech interaction. However, the prevailing autoregressive paradigm imposes strict serial constraints, limiting generation efficiency and introducing exposure bias. In this paper, we investigate Masked Diffusion Modeling~(MDM) as a non-autoregressive paradigm for speech LLMs and introduce VocalNet-MDM. To adapt MDM for streaming speech interaction, we address two critical challenges: training-inference mismatch and iterative overhead. We propose Hierarchical Block-wise Masking to align training objectives with the progressive masked states encountered during block diffusion decoding, and Iterative Self-Distillation to compress multi-step refinement into fewer steps for low-latency inference. Trained on a limited scale of only 6K hours of speech data, VocalNet-MDM achieves a 3.7$\times$--10$\times$ decoding speedup and reduces first-chunk latency by 34\% compared to AR baselines. It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness, demonstrating that MDM is a promising and scalable alternative for low-latency, efficient speech LLMs.

Problem

Research questions and friction points this paper is trying to address.

Speech LLM

autoregressive paradigm

streaming speech interaction

generation efficiency

exposure bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Diffusion Modeling

Non-autoregressive Speech LLM

Hierarchical Block-wise Masking

Iterative Self-Distillation

Streaming Speech Generation

🔎 Similar Papers

No similar papers found.