DanceGRPO: Unleashing GRPO on Visual Generation

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Reinforcement Learning from Human Feedback (RLHF) in visual generation faces critical challenges: incompatibility with ODE-based samplers, training instability at scale, and lack of systematic evaluation—particularly for video generation. Method: This paper introduces the first unified adaptation of Group Relative Policy Optimization (GRPO) to both diffusion models and rectified flow paradigms, spanning text-to-image, text-to-video, and image-to-video tasks. We propose a cross-paradigm, cross-task, and cross-model RLHF framework supporting sparse binary feedback learning and denoising trajectory modeling, while remaining fully compatible with ODE samplers and integrating multi-source rewards (e.g., aesthetic quality, prompt alignment, motion fidelity). Results: Evaluated on Stable Diffusion, HunyuanVideo, FLUX, and SkyReel-I2V, our approach achieves up to 181% improvement on HPS-v2.1—demonstrating, for the first time, GRPO’s effectiveness and robustness in large-scale video generation.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in generative models-particularly diffusion models and rectified flows-have revolutionized visual content creation, yet aligning model outputs with human preferences remains a critical challenge. Existing reinforcement learning (RL)-based methods for visual generation face critical limitations: incompatibility with modern Ordinary Differential Equations (ODEs)-based sampling paradigms, instability in large-scale training, and lack of validation for video generation. This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward). To our knowledge, DanceGRPO is the first RL-based unified framework capable of seamless adaptation across diverse generative paradigms, tasks, foundational models, and reward models. DanceGRPO demonstrates consistent and substantial improvements, which outperform baselines by up to 181% on benchmarks such as HPS-v2.1, CLIP Score, VideoAlign, and GenEval. Notably, DanceGRPO not only can stabilize policy optimization for complex video generation, but also enables generative policy to better capture denoising trajectories for Best-of-N inference scaling and learn from sparse binary feedback. Our results establish DanceGRPO as a robust and versatile solution for scaling Reinforcement Learning from Human Feedback (RLHF) tasks in visual generation, offering new insights into harmonizing reinforcement learning and visual synthesis. The code will be released.

Problem

Research questions and friction points this paper is trying to address.

Aligning generative model outputs with human preferences

Overcoming RL limitations in visual generation tasks

Unifying RL across diverse paradigms and models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified RL framework for diverse visual generation tasks

Adapts GRPO across diffusion models and rectified flows

Stabilizes policy optimization for complex video generation

🔎 Similar Papers

No similar papers found.

Authors to Follow