🤖 AI Summary
Masked Diffusion Models (MDMs) suffer from sequential decoding’s strong dependence on ordering heuristics—e.g., confidence-based sampling—which exhibit myopia, error accumulation, and inability to leverage test-time compute. This work proposes a lookahead decoding framework that requires no external reward model: it reformulates decoding as a path-selection problem guided by sequence-level uncertainty calibration; introduces a differentiable verifier for importance sampling; and couples a path generator–verifier architecture with ensemble-level importance sampling. Crucially, only 2–3 parallel decoding paths suffice for efficient inference. Evaluated across six benchmarks spanning mathematical reasoning, planning, and code generation, the method significantly outperforms existing baselines. When integrated into LLaDA, it matches the performance of RL-tuned LLaDA 1.5—and further surpasses it—demonstrating both effectiveness and scalability.
📝 Abstract
Masked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inference time order of unmasking. Prevailing heuristics, such as confidence based sampling, are myopic: they optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders without the need for an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Empirically, erroneous unmasking measurably inflates sequence level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection. The consistent improvements on both LLaDA and post-trained LLaDA 1.5 are particularly striking: base LLaDA with LookUM rivals the performance of RL-tuned LLaDA 1.5, while LookUM further enhances LLaDA 1.5 itself showing that uncertainty based verification provides orthogonal benefits to reinforcement learning and underscoring the versatility of our framework. Code will be publicly released.