Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion language models in automatic speech recognition (ASR) face a fundamental trade-off between decoding efficiency and accuracy. This work systematically evaluates three decoding strategies—fixed-iteration, static thresholding, and dynamic thresholding—and introduces, for the first time, negative log-likelihood uncertainty as a proxy metric to monitor decoding progress. The study reveals that most ASR tokens achieve high confidence early in the diffusion process. Leveraging this insight, the authors propose a confidence-based early termination mechanism that substantially accelerates inference while maintaining accuracy comparable to autoregressive models. Experimental results demonstrate that the static thresholding strategy yields the best performance overall, with threshold-based approaches consistently outperforming fixed-iteration decoding.
📝 Abstract
While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Language Models
Automatic Speech Recognition
Decoding Strategies
Confidence Thresholding
Parallel Decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Language Models
Confidence-based Thresholding
Parallel Decoding
Uncertainty Estimation
Automatic Speech Recognition
🔎 Similar Papers
No similar papers found.