🤖 AI Summary
Diffusion-based reinforcement learning (RL) for alignment commonly suffers from reward hacking—manifesting as degraded sample quality, over-stylization, and reduced diversity. To address this, we propose Data-Regularized Diffusion Reinforcement Learning (DR-DRL), a novel framework that anchors the policy distribution to the offline data distribution via a forward KL divergence constraint, enabling unbiased and robust integration of RL optimization with standard diffusion training. DR-DRL jointly optimizes for reward maximization and diffusion loss minimization, while incorporating offline data regularization and scalable RL techniques tailored for high-resolution video generation. Evaluated over millions of GPU-hours and tens of thousands of double-blind human assessments, DR-DRL achieves significant improvements in human preference scores and diversity metrics. To our knowledge, it establishes the first scalable, high-fidelity post-training paradigm for diffusion models.
📝 Abstract
Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.