🤖 AI Summary
To address the challenge of precisely aligning masked discrete diffusion models with downstream reward signals, this paper proposes the Stepwise Trajectory Alignment (STA) framework. STA decomposes the reward over the full diffusion trajectory into additive temporal factors, enabling explicit, differentiable alignment between each denoising step and arbitrary reward functions. Unlike conventional reinforcement learning–based reward backpropagation, STA avoids estimation bias and strictly preserves optimality equivalence while supporting end-to-end preference optimization. Empirical evaluation demonstrates significant improvements across diverse discrete sequence generation tasks: a 12% gain in DNA sequence design performance, and an increase in GSM8K accuracy from 78.6% to 80.7%. Additional gains are observed in protein inverse folding. These results validate STA’s generality and effectiveness for reward-guided discrete diffusion modeling.
📝 Abstract
Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose a novel preference optimization method for masked discrete diffusion models through a principled diffusion trajectory alignment. Instead of applying the reward on the final output and backpropagating the gradient to the entire discrete denoising process, we decompose the problem into a set of stepwise alignment objectives. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, guarantees an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 80.7 on LLaDA-8B-Instruct for language modeling.