Diffusion Language Models Know the Answer Before Decoding

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Diffusion language models (DLMs) suffer from slow inference due to bidirectional attention and iterative optimization. This work identifies, for the first time, premature convergence during DLM generation—where outputs stabilize early but unnecessary refinement steps persist. To address this, we propose Prophet, a training-free, plug-and-play decoding paradigm that reframes decoding efficiency as a dynamic stopping decision problem. Prophet leverages the confidence gap between the top-2 token candidates to assess convergence, and integrates semi-autoregressive generation with stochastic re-masking scheduling to determine in real time whether to terminate decoding and commit the current output. Evaluated on LLaDA-8B and Dream-7B, Prophet achieves up to 3.4× inference speedup. Under half-step decoding, it attains 97% accuracy on GSM8K and 99% on MMLU—matching full-length decoding performance—demonstrating substantial acceleration without compromising generation quality.

Technology Category

Application Category

📝 Abstract

Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.

Problem

Research questions and friction points this paper is trying to address.

Accelerating diffusion language model inference speed

Reducing required refinement steps during decoding

Enabling early commit decisions without quality loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Early answer convergence in diffusion models

Training-free fast decoding paradigm

Dynamic refinement with confidence gap criterion

🔎 Similar Papers

No similar papers found.