🤖 AI Summary
GRPO suffers from significant computational redundancy in long-context training due to repeated encoding of shared prefixes, severely limiting scalability. To address this, we propose Shared-Prefix Forward (SPF), the first end-to-end redundancy-free training strategy for GRPO. SPF fundamentally restructures the self-attention mechanism to decouple prefix-shared and suffix-independent computation, while preserving gradient flow, caching prefix activations, and dynamically constructing inputs—all under a rigorous theoretical guarantee of exact equivalence to standard GRPO. As a plug-and-play module, SPF is fully compatible with existing architectures and enables larger group sizes and longer contexts. Experiments demonstrate substantial reductions in training cost—especially with long shared prefixes—without compromising optimization trajectory or policy performance. The implementation is publicly available.
📝 Abstract
Group Relative Policy Optimization (GRPO) enhances policy learning by computing gradients from relative comparisons among candidate outputs that share a common input prefix. Despite its effectiveness, GRPO introduces substantial computational overhead when processing long shared prefixes, which must be redundantly encoded for each group member. This inefficiency becomes a major scalability bottleneck in long-context learning scenarios. We propose Prefix Grouper, an efficient GRPO training algorithm that eliminates redundant prefix computation via a Shared-Prefix Forward strategy. In particular, by restructuring self-attention into two parts, our method enables the shared prefix to be encoded only once, while preserving full differentiability and compatibility with end-to-end training. We provide both theoretical and empirical evidence that Prefix Grouper is training-equivalent to standard GRPO: it yields identical forward outputs and backward gradients, ensuring that the optimization dynamics and final policy performance remain unchanged. Empirically, our experiments confirm that Prefix Grouper achieves consistent results while significantly reducing the computational cost of training, particularly in long-prefix scenarios. The proposed method is fully plug-and-play: it is compatible with existing GRPO-based architectures and can be seamlessly integrated into current training pipelines as a drop-in replacement, requiring no structural modifications and only minimal changes to input construction and attention computation. Prefix Grouper enables the use of larger group sizes under the same computational budget, thereby improving the scalability of GRPO to more complex tasks and larger models. Code is now available at https://github.com/johncaged/PrefixGrouper