EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of reward-guided training in discrete diffusion language models, where non-differentiable sampling necessitates approximations such as continuous relaxation or straight-through estimators—both of which often introduce gradient distortion or suboptimal optimization. To overcome this limitation, the authors propose an entropy-aware dynamic gradient modulation mechanism that adaptively adjusts the gradient feedback from continuous relaxation based on the model’s predictive confidence. This approach enhances guidance efficacy while preserving semantic plausibility, effectively breaking the traditional trade-off between gradient fidelity and optimization efficiency. Evaluated on a 7B-parameter diffusion language model across three distinct reward models and three multi-skill benchmarks, the method consistently outperforms current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

reward guidance
discrete diffusion language models
gradient feedback
continuous relaxation
straight-through estimator
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward guidance
discrete diffusion models
entropy-aware
continuous relaxation
straight-through estimator
🔎 Similar Papers
No similar papers found.
Atula Tejaswi
Atula Tejaswi
University of Texas at Austin
Deep LearningNatural Language ProcessingGraph Neural NetworksInformation Retrieval
Litu Rout
Litu Rout
University of Texas at Austin
Machine LearningGenerative ModelingSamplingOptimization
C
C. Caramanis
The University of Texas at Austin
S
Sanjay Shakkottai
The University of Texas at Austin
S
S. Sanghavi
The University of Texas at Austin