🤖 AI Summary
This work addresses the trajectory locking problem in diffusion language models during reward-maximization-based post-training, which severely limits coverage of the solution space. To mitigate this issue, the authors propose TraFL, the first approach to integrate trajectory balance into diffusion language model post-training. TraFL constructs a reward-tilted distribution anchored by a frozen reference model and introduces a diffusion-compatible sequence-level surrogate objective along with a learnable prompt-dependent normalization mechanism. This framework effectively alleviates trajectory locking, substantially enhancing both generation diversity and performance consistency. Empirical results demonstrate that TraFL consistently outperforms existing baselines across varying difficulty levels and sequence lengths on mathematical reasoning (Minerva Math) and code generation (LiveCodeBench) benchmarks.
📝 Abstract
Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting, with gains that persist as the sampling budget increases. The improvements transfer to held-out evaluations: TraFL stays above the base model on Minerva Math and is the strongest method on every LiveCodeBench difficulty split.