🤖 AI Summary
Diffusion language models in automatic speech recognition (ASR) face a fundamental trade-off between decoding efficiency and accuracy. This work systematically evaluates three decoding strategies—fixed-iteration, static thresholding, and dynamic thresholding—and introduces, for the first time, negative log-likelihood uncertainty as a proxy metric to monitor decoding progress. The study reveals that most ASR tokens achieve high confidence early in the diffusion process. Leveraging this insight, the authors propose a confidence-based early termination mechanism that substantially accelerates inference while maintaining accuracy comparable to autoregressive models. Experimental results demonstrate that the static thresholding strategy yields the best performance overall, with threshold-based approaches consistently outperforming fixed-iteration decoding.
📝 Abstract
While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.