🤖 AI Summary
GRPO faces a fundamental trade-off between large-group sampling and high computational cost in generative model alignment. This work identifies the reward clustering phenomenon and proposes Pro-GRPO, a dynamic trajectory pruning framework. Methodologically, it introduces the novel “Expand-and-Prune” strategy: first expanding the initial sample set to enhance trajectory diversity, then applying Optimal Variance Filtering (OVF)—a stepwise pruning mechanism based on latent feature representations—to selectively retain high-variance trajectories. Crucially, pruning is embedded directly into the sampling process, enabling early termination and adaptive group-size control. The framework is compatible with both diffusion and streaming generative models. Experiments demonstrate that, under fixed compute budgets, Pro-GRPO reduces reward variance by 37%, accelerates convergence by 2.1×, and generalizes effectively to multi-task instruction tuning. Overall, it achieves a Pareto improvement—simultaneously reducing computational overhead and enhancing alignment performance.
📝 Abstract
Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.