🤖 AI Summary
This work addresses the key challenge in applying reinforcement learning to diffusion models: efficiently aligning denoising generative models with human preferences or verifiable rewards, which is hindered by the intractability of likelihood-based gradient estimation. The authors propose an online reinforcement learning method based on an Evidence Lower Bound (ELBO) surrogate, integrating Group Relative Policy Optimization (GRPO), gradient step-size control, and variance reduction techniques. This approach maintains consistency with the pretraining objective while significantly enhancing training stability and efficiency. It is the first to demonstrate that an ELBO surrogate can simultaneously achieve high efficiency and stability, overcoming the inefficiency of conventional MDP trajectory optimization and the suboptimal performance of prior ELBO-based methods. On text-to-image generation tasks, the method achieves state-of-the-art performance, training twice as fast as MixGRPO and three times faster than DiffusionNFT.
📝 Abstract
Aligning denoising generative models with human preferences or verifiable rewards remains a key challenge. While policy-gradient online reinforcement learning (RL) offers a principled post-training framework, its direct application is hindered by the intractable likelihoods of these models. Prior work therefore either optimizes an induced Markov decision process (MDP) over sampling trajectories, which is stable but inefficient, or uses likelihood surrogates based on the diffusion evidence lower bound (ELBO), which have so far underperformed on visual generation. Our key insight is that the ELBO-based approach can, in fact, be made both stable and efficient. By reducing surrogate variance and controlling gradient steps, we show that this approach can beat MDP-based methods. To this end, we introduce Variational GRPO (V-GRPO), a method that integrates ELBO-based surrogates with the Group Relative Policy Optimization (GRPO) algorithm, alongside a set of simple yet essential techniques. Our method is easy to implement, aligns with pretraining objectives, and avoids the limitations of MDP-based methods. V-GRPO achieves state-of-the-art performance in text-to-image synthesis, while delivering a $2\times$ speedup over MixGRPO and a $3\times$ speedup over DiffusionNFT.