Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GRPO faces a fundamental trade-off between large-group sampling and high computational cost in generative model alignment. This work identifies the reward clustering phenomenon and proposes Pro-GRPO, a dynamic trajectory pruning framework. Methodologically, it introduces the novel “Expand-and-Prune” strategy: first expanding the initial sample set to enhance trajectory diversity, then applying Optimal Variance Filtering (OVF)—a stepwise pruning mechanism based on latent feature representations—to selectively retain high-variance trajectories. Crucially, pruning is embedded directly into the sampling process, enabling early termination and adaptive group-size control. The framework is compatible with both diffusion and streaming generative models. Experiments demonstrate that, under fixed compute budgets, Pro-GRPO reduces reward variance by 37%, accelerates convergence by 2.1×, and generalizes effectively to multi-task instruction tuning. Overall, it achieves a Pareto improvement—simultaneously reducing computational overhead and enhancing alignment performance.

Technology Category

Application Category

📝 Abstract
Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However, this static, post-sampling OVF approach still necessitates critical computational overhead, as it performs unnecessary sampling for trajectories that are ultimately discarded. To resolve this, we propose Pro-GRPO (Proactive GRPO), a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process. Through the early termination of reward-clustered trajectories, Pro-GRPO reduces computational overhead. Leveraging its efficiency, Pro-GRPO employs an "Expand-and-Prune" strategy. This strategy first expands the size of initial sampling group to maximize trajectory diversity, then it applies multi-step OVF to the latents, avoiding prohibitive computational costs. Extensive experiments on both diffusion-based and flow-based models demonstrate the generality and effectiveness of our Pro-GRPO framework.
Problem

Research questions and friction points this paper is trying to address.

Resolves computational bottleneck in GRPO from large group sizes
Addresses reward clustering where trajectories collapse to group mean
Reduces overhead of unnecessary sampling in trajectory filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic trajectory pruning reduces computational overhead
Expand-and-prune strategy maximizes diversity efficiently
Multi-step OVF on latents avoids prohibitive costs
🔎 Similar Papers
No similar papers found.