CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the high computational overhead in GRPO-based reasoning model training caused by multi-completion sampling. We propose Completion Pruning Policy Optimization (CPPO), a novel framework that improves training efficiency without sacrificing inference accuracy. Methodologically, CPPO introduces (1) an absolute advantage threshold–based completion pruning mechanism to eliminate low-quality candidate sequences during sampling, and (2) a dynamic “fill-in-the-blank” batch scheduling strategy coupled with gradient sparsity-aware updates to enhance GPU utilization and throughput. Evaluated on GSM8K and MATH benchmarks, CPPO achieves 8.32× and 3.51× training speedups, respectively, while matching or exceeding the original GRPO’s inference accuracy. These results demonstrate CPPO’s effectiveness in alleviating the efficiency bottleneck inherent in large language model reasoning alignment training.

Technology Category

Application Category

📝 Abstract
This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32 imes$ speedup on GSM8K and $3.51 imes$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.
Problem

Research questions and friction points this paper is trying to address.

Accelerate training of GRPO-based reasoning models
Reduce high training costs from multiple completions
Optimize GPU utilization with dynamic allocation strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prunes low-advantage completions to reduce costs
Uses dynamic completion allocation for GPU efficiency
Achieves significant speedup while maintaining accuracy
🔎 Similar Papers
No similar papers found.
Zhihang Lin
Zhihang Lin
Xiamen University & Shanghai Innovation Institute
Efficient Artificial Intelligence
Mingbao Lin
Mingbao Lin
Principal Research Scientist, Rakuten
Model Compression(Multimodal) LLMsDiffusion Models
Y
Yuan Xie
Shanghai Innovation Institute, China; East China Normal University, Shanghai, China
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China