BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and tendency toward verbose reasoning trajectories in Group Relative Policy Optimization (GRPO) for training reasoning models. Observing that gradient updates from correct or incorrect completions within the same group exhibit highly aligned directions, while stronger contrastive signals arise between correct and incorrect completions, the authors propose Binary Prefix Policy Optimization (BPPO). BPPO leverages only the shortest correct and incorrect completions as compact update units, integrating adaptive scheduling and prefix-focused optimization to preserve the benefits of group-wide advantage normalization while substantially improving efficiency. Evaluated on GSM8K, MATH, and Geo3K, BPPO achieves up to 6.08× faster training than GRPO with comparable accuracy, reduces average response length by 30–50%, and requires no explicit length reward.
📝 Abstract
Group Relative Policy Optimization (GRPO) is widely used for training reasoning models, but updating all sampled completions in each group incurs substantial cost and can reinforce verbose reasoning trajectories. In this paper, we study whether all completions provide equally useful update signals in GRPO-style reasoning RL. Our gradient-similarity analysis shows that, within the same prompt group, same-class completions often induce highly similar update directions, whereas correct-incorrect pairs provide more distinct contrastive signals. Motivated by this observation, we propose Binary Prefix Policy Optimization (BPPO), which uses the shortest correct completion and the shortest incorrect completion as a compact update unit while preserving full-group advantage normalization. BPPO further improves efficiency with adaptive completion scheduling and prefix-focused optimization; by updating only response prefixes, it avoids reinforcing redundant suffixes and encourages more concise responses. Experiments on GSM8K, MATH, and Geo3K show that BPPO achieves up to 6.08x speedup over GRPO while maintaining competitive accuracy, and reduces mean response length by approximately 30-50% without modifying the reward with an explicit length penalty.
Problem

Research questions and friction points this paper is trying to address.

GRPO
reasoning RL
efficient training
concise responses
policy optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary Prefix Policy Optimization
GRPO-style reasoning RL
gradient similarity analysis
prefix-focused optimization
concise response generation
🔎 Similar Papers
No similar papers found.