🤖 AI Summary
This work addresses the challenge of reward-guided training in discrete diffusion language models, where non-differentiable sampling necessitates approximations such as continuous relaxation or straight-through estimators—both of which often introduce gradient distortion or suboptimal optimization. To overcome this limitation, the authors propose an entropy-aware dynamic gradient modulation mechanism that adaptively adjusts the gradient feedback from continuous relaxation based on the model’s predictive confidence. This approach enhances guidance efficacy while preserving semantic plausibility, effectively breaking the traditional trade-off between gradient fidelity and optimization efficiency. Evaluated on a 7B-parameter diffusion language model across three distinct reward models and three multi-skill benchmarks, the method consistently outperforms current state-of-the-art approaches.
📝 Abstract
Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.