Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the coarse credit assignment problem in existing reinforcement learning approaches for mathematical reasoning, where reliance on final-answer rewards fails to distinguish effective reasoning steps from redundant ones, often leading to correct reasoning being overwritten or excessively verbose responses. To overcome this, the authors propose SPAE (Step Potential-based Advantage Estimation), a training-free probing mechanism that, for the first time, estimates the semantic quality of each reasoning step by integrating intermediate confidence and correctness into a Step Potential signal. This enables fine-grained credit assignment and timely termination control. Experiments demonstrate that SPAE significantly improves accuracy across multiple mathematical reasoning benchmarks while substantially reducing response length, outperforming both reinforcement learning and efficient reasoning baselines.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs), but outcome-based rewards lead to coarse-grained advantage estimation. While existing approaches improve RLVR via token-level entropy or sequence-level length control, they lack a semantically grounded, step-level measure of reasoning progress. As a result, LLMs fail to distinguish necessary deduction from redundant verification: they may continue checking after reaching a correct solution and, in extreme cases, overturn a correct trajectory into an incorrect final answer. To remedy the lack of process supervision, we introduce a training-free probing mechanism that extracts intermediate confidence and correctness and combines them into a Step Potential signal that explicitly estimates the reasoning state at each step. Building on this signal, we propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination. Experiments across multiple benchmarks show SPAE consistently improves accuracy while substantially reducing response length, outperforming strong RL baselines and recent efficient reasoning and token-level advantage estimation methods. The code is available at https://github.com/cii030/SPAE-RL.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

mathematical reasoning

advantage estimation

step-level supervision

reasoning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step Potential

Advantage Estimation

Process Supervision