Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Diffusion language models in automatic speech recognition (ASR) face a fundamental trade-off between decoding efficiency and accuracy. This work systematically evaluates three decoding strategies—fixed-iteration, static thresholding, and dynamic thresholding—and introduces, for the first time, negative log-likelihood uncertainty as a proxy metric to monitor decoding progress. The study reveals that most ASR tokens achieve high confidence early in the diffusion process. Leveraging this insight, the authors propose a confidence-based early termination mechanism that substantially accelerates inference while maintaining accuracy comparable to autoregressive models. Experimental results demonstrate that the static thresholding strategy yields the best performance overall, with threshold-based approaches consistently outperforming fixed-iteration decoding.

📝 Abstract

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Language Models

Automatic Speech Recognition

Decoding Strategies

Confidence Thresholding

Parallel Decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Language Models

Confidence-based Thresholding

Parallel Decoding