Roll Out and Roll Back: Diffusion LLMs are Their Own Efficiency Teachers

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the significant performance degradation in open-source diffusion-based large language models during parallel generation, caused by training-inference mismatch and irreversible decoding. To resolve this, the authors propose a reversible parallel decoding mechanism that aggressively generates multiple tokens in parallel and then validates them against global context, reverting unreliable tokens to enable self-discovered, reliable denoising orders. This approach aligns training with efficient inference and, for the first time, enables reversible parallel generation in diffusion LLMs. Leveraging this mechanism, the model performs trajectory knowledge distillation using its own refined outputs as supervision. The proposed WINO and its enhanced variant WINO+ achieve 75.82% and 76.58% accuracy on GSM8K with 6.10× and 6.83× speedups, respectively, while WINO+ attains a 16.22× acceleration on Flickr30K with improved CIDEr scores.

📝 Abstract

Diffusion Large Language Models (DLLMs) promise fast parallel generation, yet open-source DLLMs still face a severe quality-speed trade-off: accelerating decoding by revealing multiple tokens often causes substantial quality degradation. We attribute this dilemma to a train-inference mismatch amplified by irreversible decoding. While training reconstructs tokens from randomly corrupted states, efficient inference requires an adaptive denoising order, where easier tokens are revealed earlier and context-dependent ones are deferred. This view motivates two complementary methods: an inference-time method that makes parallel decoding revokable, and a training-time extension that distills the reliable order exposed by this revokable process. Accordingly, we first propose Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable parallel generation. WINO aggressively drafts multiple tokens, verifies generated tokens with enriched global context, and re-masks unreliable ones for later refinement. Building on this discovered order, we further introduce WINO+, which injects the verified denoising trajectories produced by WINO into model parameters, aligning training with efficient inference. Experiments on LLaDA and MMaDA show that WINO improves both quality and efficiency, while WINO+ further strengthens this progression. On GSM8K, WINO improves accuracy from 73.24% to 75.82% with a 6.10x step reduction, and WINO+ further achieves 76.58% with a 6.83x reduction. On Flickr30K, WINO+ reaches a 16.22x step reduction with improved CIDEr. These results demonstrate that DLLMs can serve as their own efficiency teachers by first discovering reliable denoising orders through revokable decoding and then learning to follow them for faster generation. Code is available at https://github.com/Feng-Hong/WINO-DLLM/tree/WINO-plus.

Problem

Research questions and friction points this paper is trying to address.

Diffusion LLMs

quality-speed trade-off

parallel generation

decoding efficiency

token generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion LLM

revokable decoding

denoising order