🤖 AI Summary
Flow matching models suffer from low sampling efficiency in RL training—due to fixed batch sizes ignoring inter-prompt importance differences—and biased credit assignment—caused by reusing trajectory-level advantages across all timesteps. To address these, we propose dynamic grouping sampling and continuous-time-consistent step-level advantage estimation. Our key contributions are: (1) a variance-aware dynamic batch-size adaptation mechanism to improve importance sampling efficiency; and (2) step-level advantage computation grounded in continuous-flow dynamics, mitigating credit misallocation along the flow. The method integrates flow matching, GRPO-style policy optimization, adaptive importance sampling, and continuous-time advantage estimation. Experiments show reductions of 5.4%–56.3% in training steps and 5.2%–16.7% in wall-clock time. On text-to-image generation, our approach outperforms SD3.5-M by 4.6%–47.2% and Flow-GRPO by 1.7%–16.0% in sample quality metrics.
📝 Abstract
Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.