Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Standard decoding strategies for masked diffusion language models (MDLMs)—such as confidence-based sampling—tend to degenerate into autoregressive-like token generation, undermining their inherent non-autoregressive modeling advantages. To address this, we propose Reward-Weighted Sampling (RWS), the first method to incorporate an external reward model into diffusion logit reweighting: it dynamically adjusts per-step token selection probabilities based on global sequence quality. We theoretically prove that RWS induces beneficial ranking reversals, thereby increasing the expected reward. RWS synergistically combines confidence-aware sampling with reward-driven logit scaling, leveraging intermediate sequence feedback during iterative denoising to refine non-autoregressive decoding. Experiments demonstrate that RWS significantly improves parallelism and fill-in flexibility, yielding consistent gains across BLEU, ROUGE, and generation coherence metrics.

Technology Category

Application Category

📝 Abstract

Masked diffusion models (MDMs) offer a promising non-autoregressive alternative for large language modeling. Standard decoding methods for MDMs, such as confidence-based sampling, select tokens independently based on individual token confidences at each diffusion step. However, we observe that this independent token selection often results in generation orders resembling sequential autoregressive processes, limiting the advantages of non-autoregressive modeling. To mitigate this pheonomenon, we propose Reward-Weighted Sampling (RWS), a novel decoding strategy that leverages an external reward model to provide a principled global signal during the iterative diffusion process. Specifically, at each diffusion step, RWS evaluates the quality of the entire intermediate sequence and scales token logits accordingly, guiding token selection by integrating global sequence-level coherence. This method selectively increases the confidence of tokens that initially have lower scores, thereby promoting a more non-autoregressive generation order. Furthermore, we provide theoretical justification showing that reward-weighted logit scaling induces beneficial rank reversals in token selection and consistently improves expected reward. Experiments demonstrate that RWS significantly promotes non-autoregressive generation orders, leading to improvements across multiple evaluation metrics. These results highlight the effectiveness of integrating global signals in enhancing both the non-autoregressive properties and overall performance of MDMs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing non-autoregressive generation in masked diffusion LLMs

Addressing autoregressive-like token selection in diffusion models

Improving global sequence coherence through reward-weighted sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-Weighted Sampling decoding strategy

External reward model guides token selection

Global sequence-level coherence integration

🔎 Similar Papers

TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings