Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing diffusion-based large language models (DLLMs) suffer from a severe speed–quality trade-off during parallel decoding: acceleration typically degrades generation quality. To address this, we propose a training-free, reversible “broad-in, narrow-out” decoding framework that introduces bidirectional-context-aware parallel draft generation and dynamic verification—enabling real-time erroneous token detection and re-masking correction, thereby departing from conventional unidirectional, irreversible decoding paradigms. Evaluated on open-source DLLMs including LLaDA and MMaDA, our method achieves 6× speedup on GSM8K mathematical reasoning with a +2.58% accuracy gain, and 10× speedup on Flickr30K image captioning with improved evaluation metrics. Our core contribution is the first zero-training-cost parallel decoding scheme that simultaneously delivers high-quality outputs, computational efficiency, and robustness against decoding errors.

Technology Category

Application Category

📝 Abstract

Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6$ imes$ while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10$ imes$ speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.

Problem

Research questions and friction points this paper is trying to address.

Resolve quality-speed trade-off in DLLMs

Address irreversibility in standard DLLM decoding

Improve parallel decoding efficiency and accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free revokable decoding algorithm

Parallel draft-and-verify token mechanism

Bidirectional context verification for refinement

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

2024-08-16arXiv.orgCitations: 4

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

Authors to Follow