Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the significant computational redundancy in Masked Diffusion language models during decoding, where already converged unmasked tokens repeatedly undergo attention and feedforward computations. To mitigate this inefficiency, the authors propose SureLock, a novel token-level computation halting mechanism grounded in posterior stability. Specifically, SureLock monitors the stability of each token’s posterior distribution via local KL divergence; once a token is deemed stable, its query projection and feedforward computations are skipped, and its key-value pairs are cached for reuse by other positions. This dynamic strategy adaptively reduces per-iteration computation, achieving 30%–50% FLOPs savings on LLaDA-8B while preserving generation quality comparable to the original sampler.

Technology Category

Application Category

📝 Abstract

Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step -- even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position -- thereafter skipping its query projection and feed-forward sublayers -- while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30--50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .

Problem

Research questions and friction points this paper is trying to address.

Masked Diffusion Language Models

converged tokens

computation waste

iterative decoding

token locking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked Diffusion Language Model

SureLock

Early Exit