SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

πŸ“… 2026-04-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

159K/year
πŸ€– AI Summary
Standard token-level PPO struggles in long-horizon reasoning tasks due to unstable temporal credit assignment and the high memory overhead of value models, while critic-free approaches alleviate these issues at the cost of substantial computational expense. This work proposes Sequence-level PPO (SPPO), which formulates reasoning as a sequence-level contextual bandit problem and employs a decoupled scalar value function to produce low-variance advantage estimates. SPPO achieves both sample efficiency and update stability without requiring multi-sample baselines. Experimental results demonstrate that SPPO significantly outperforms standard PPO on mathematical reasoning benchmarks, matching the performance of computationally intensive group-based policy methods while substantially improving training throughput and resource efficiency.

Technology Category

Application Category

πŸ“ Abstract
Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without multi-sampling. Extensive experiments on mathematical benchmarks demonstrate that SPPO significantly surpasses standard PPO and matches the performance of computation-heavy group-based methods, offering a resource-efficient framework for aligning reasoning LLMs.
Problem

Research questions and friction points this paper is trying to address.

Long-Horizon Reasoning
Proximal Policy Optimization
Chain-of-Thought
Credit Assignment
Sample Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequence-Level PPO
Contextual Bandit
Low-Variance Advantage
Long-Horizon Reasoning
Sample Efficiency