Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the granularity imbalance in credit assignment for large language model reinforcement learning—where token-level methods (e.g., PPO) suffer from inaccurate advantage estimation due to critic training instability, and trajectory-level methods (e.g., GRPO) lack precision by relying solely on terminal rewards—this paper proposes Segment-level Policy Optimization (SPO). SPO introduces a critic-free, Monte Carlo–based advantage estimation at the *semantic segment* level: it employs cut-point–driven chained segmentation and tree-structured advantage propagation over reasoning paths to achieve dynamic, structure-aware segment partitioning; policy updates are then performed segment-wise via probabilistic masking. On GSM8K, SPO outperforms PPO and GRPO by 6–12 percentage points; on MATH500 (with 2K/4K context), it surpasses GRPO by 7–11 points. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.
Problem

Research questions and friction points this paper is trying to address.

Improves credit assignment in RL for large language models
Balances granularity between token-level and trajectory-level methods
Enhances reasoning via segment-level advantage estimation without critic model
Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment-level advantage estimation for RL
Flexible segment partition strategy
Novel probability-mask policy optimization
🔎 Similar Papers
2024-06-27Conference on Empirical Methods in Natural Language ProcessingCitations: 1