Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

๐Ÿ“… 2026-02-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Diffusion models pose a challenge for direct application of policy gradientโ€“based reinforcement learning methods due to their intractable likelihood, and existing research lacks a systematic analysis of how likelihood estimation affects optimization. This work presents the first disentanglement of three key components in reinforcement learning for diffusion models: the policy gradient objective, the likelihood estimator, and the sampling strategy. The study reveals that the final-sample likelihood estimate based on the evidence lower bound (ELBO) is the dominant factor governing optimization efficacy, underscoring the centrality of likelihood estimation over reliance on loss function design. Experiments on SD 3.5 Medium demonstrate that the proposed approach improves the GenEval score from 0.24 to 0.95, achieves 4.6ร— higher training efficiency than FlowGRPO and 2ร— that of the state-of-the-art DiffusionNFT, and exhibits no reward hacking behavior.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Diffusion Models
Likelihood Estimation
Policy Gradient
Text-to-Image Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

likelihood estimation
diffusion models
reinforcement learning
ELBO
policy gradient
๐Ÿ”Ž Similar Papers
No similar papers found.