dLLM-ASR: A Faster Diffusion LLM-based Framework for Speech Recognition

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work proposes an efficient diffusion-based large language model (LLM) framework for automatic speech recognition (ASR) that addresses the latency limitations of traditional autoregressive LLM-ASR—where inference delay grows linearly with sequence length—and the paradigm mismatch encountered when directly applying discrete diffusion LLMs (dLLMs) to ASR due to incompatible acoustic conditioning and generation dynamics. By incorporating ASR-specific priors into the dLLM for the first time, the method enables parallelized transcription through prior-guided adaptive denoising, length-adaptive pruning, and confidence-driven early stopping. The approach achieves comparable accuracy to autoregressive LLM-ASR while accelerating inference by a factor of 4.44, effectively resolving the fundamental paradigm misalignment of dLLMs in ASR tasks.

Technology Category

Application Category

📝 Abstract

Automatic speech recognition (ASR) systems based on large language models (LLMs) achieve superior performance by leveraging pretrained LLMs as decoders, but their token-by-token generation mechanism leads to inference latency that grows linearly with sequence length. Meanwhile, discrete diffusion large language models (dLLMs) offer a promising alternative, enabling high-quality parallel sequence generation with pretrained decoders. However, directly applying native text-oriented dLLMs to ASR leads to a fundamental mismatch between open-ended text generation and the acoustically conditioned transcription paradigm required by ASR. As a result, it introduces unnecessary difficulty and computational redundancy, such as denoising from pure noise, inflexible generation lengths, and fixed denoising steps. We propose dLLM-ASR, an efficient dLLM-based ASR framework that formulates dLLM's decoding as a prior-guided and adaptive denoising process. It leverages an ASR prior to initialize the denoising process and provide an anchor for sequence length. Building upon this prior, length-adaptive pruning dynamically removes redundant tokens, while confidence-based denoising allows converged tokens to exit the denoising loop early, enabling token-level adaptive computation. Experiments demonstrate that dLLM-ASR achieves recognition accuracy comparable to autoregressive LLM-based ASR systems and delivers a 4.44$\times$ inference speedup, establishing a practical and efficient paradigm for ASR.

Problem

Research questions and friction points this paper is trying to address.

automatic speech recognition

large language models

discrete diffusion

inference latency

acoustic conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion LLM

ASR prior

length-adaptive pruning