🤖 AI Summary
This work addresses a fundamental trade-off in diffusion language models under stochastic-order decoding: improving the quality of individual samples often suppresses the model’s ability to explore diverse generation paths. The paper provides the first unified explanation of this “quality–exploration dilemma” by analyzing how low-confidence re-masking constrains the entropy of the sequence distribution. To explicitly balance these competing objectives, the authors derive a theoretically optimal target distribution and design an independent Metropolis–Hastings sampler to efficiently approximate it. Experiments demonstrate that the proposed approach significantly outperforms existing stochastic and low-confidence re-masking strategies on reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP, achieving a superior compromise between generation quality and exploration diversity.
📝 Abstract
Diffusion large language models (dLLMs) theoretically permit token decoding in arbitrary order, a flexibility that could enable richer exploration of reasoning paths than autoregressive (AR) LLMs. In practice, however, random-order decoding often hurts generation quality. To mitigate this, low-confidence remasking improves single-sample quality (e.g., Pass@$1$) by prioritizing confident tokens, but it also suppresses exploration and limits multi-sample gains (e.g., Pass@$k$), creating a fundamental quality--exploration dilemma. In this paper, we provide a unified explanation of this dilemma. We show that low-confidence remasking improves a myopic proxy for quality while provably constraining the entropy of the induced sequence distribution. To overcome this limitation, we characterize the optimal distribution that explicitly balances quality and exploration, and develop a simple Independent Metropolis--Hastings sampler that approximately targets this distribution during decoding. Experiments across a range of reasoning benchmarks including MATH500, AIME24/25, HumanEval, and MBPP show that our approach yields better exploration-quality tradeoff than both random and low-confidence remasking.