OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

📅 2026-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the low sample efficiency of Generalized Reward-Preference Optimization (GRPO) in flow matching models, which stems from its on-policy training paradigm. The paper proposes the first off-policy GRPO framework tailored for flow matching, incorporating a trajectory replay buffer and a high-quality trajectory selection strategy. To reconcile off-policy data with GRPO’s clipping mechanism, the authors design a sequence-level importance sampling scheme that effectively mitigates the pathological off-policy ratio issue prevalent in later denoising steps. Empirical results on image and video generation tasks demonstrate that the proposed method achieves comparable or superior generation performance using only approximately 34.2% of the training steps required by Flow-GRPO, substantially improving training efficiency.
📝 Abstract
Post training via GRPO has demonstrated remarkable effectiveness in improving the generation quality of flow-matching models. However, GRPO suffers from inherently low sample efficiency due to its on-policy training paradigm. To address this limitation, we present OP-GRPO, the first Off-Policy GRPO framework tailored for flow-matching models. First, we actively select high-quality trajectories and adaptively incorporate them into a replay buffer for reuse in subsequent training iterations. Second, to mitigate the distribution shift introduced by off-policy samples, we propose a sequence-level importance sampling correction that preserves the integrity of GRPO's clipping mechanism while ensuring stable policy updates. Third, we theoretically and empirically show that late denoising steps yield ill-conditioned off-policy ratios, and mitigate this by truncating trajectories at late steps. Across image and video generation benchmarks, OP-GRPO achieves comparable or superior performance to Flow-GRPO with only 34.2% of the training steps on average, yielding substantial gains in training efficiency while maintaining generation quality.
Problem

Research questions and friction points this paper is trying to address.

GRPO
flow-matching models
sample efficiency
on-policy training
off-policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-Policy Learning
Flow-Matching Models
GRPO
Replay Buffer
Importance Sampling
🔎 Similar Papers
No similar papers found.
L
Liyu Zhang
College of Control Science and Engineering, Zhejiang University
Kehan Li
Kehan Li
Stanford University
T
Tingrui Han
Central Research Institute, Huawei
Tao Zhao
Tao Zhao
Meta
Distributed SystemsIntent-Based Networking
Y
Yuxuan Sheng
College of Control Science and Engineering, Zhejiang University
Shibo He
Shibo He
Professor, College of Control Science and Engineering, Zhejiang university
Internet of ThingsBig DataNetwork Science
Chao Li
Chao Li
Zhejiang University
in-memory computingnon-volatile memoryhardware acceleration