Flow-GRPO: Training Flow Matching Models via Online RL

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the absence of an online reinforcement learning (RL) training framework for flow matching (FM) models. It introduces, for the first time, online RL into the text-to-image FM paradigm, significantly improving semantic fidelity and controllability. Methodologically: (1) an ODE-to-SDE conversion mechanism is proposed to enable statistically interpretable sampling-based exploration; (2) a Denoising Reduction strategy is designed to accelerate training-time sampling under fixed inference steps. Evaluated on SD3.5, our approach raises GenEval accuracy on complex vision-language understanding from 63% to 95%, and text rendering accuracy from 59% to 92%. Human preference studies confirm substantial improvements without reward hacking. This work establishes the first systematic framework and key technical pathway for integrating flow matching with online RL.

Technology Category

Application Category

📝 Abstract
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from $63%$ to $95%$. In visual text rendering, its accuracy improves from $59%$ to $92%$, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.
Problem

Research questions and friction points this paper is trying to address.

Integrating online RL into flow matching models
Improving sampling efficiency via ODE-to-SDE conversion
Enhancing text-to-image generation accuracy and alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates online RL into flow matching models
Uses ODE-to-SDE conversion for RL exploration
Employs Denoising Reduction for efficient training
🔎 Similar Papers
No similar papers found.