Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the high training cost and instability encountered when aligning video diffusion models with human preferences, which existing methods exacerbate due to suboptimal optimization from timestep sampling. To overcome these limitations, the authors propose Flash-GRPO, a single-step policy optimization framework that ensures temporal consistency within prompts through isotemporal grouping, thereby eliminating timestep-confounded variance. Additionally, a temporal gradient correction mechanism is introduced to mitigate inconsistencies in gradient magnitudes across timesteps. Evaluated across models ranging from 1.3B to 14B parameters, Flash-GRPO significantly enhances both training efficiency and stability, achieving superior alignment performance under limited computational budgets compared to full-trajectory training and establishing state-of-the-art results.

📝 Abstract

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

Problem

Research questions and friction points this paper is trying to address.

video diffusion

policy optimization

computational bottleneck

training efficiency

human preference alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flash-GRPO

iso-temporal grouping

temporal gradient rectification