Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

πŸ“… 2026-01-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Standard GRPO employs coarse-grained credit assignment in mathematical reasoning by uniformly distributing the overall reward across all tokens, thereby ignoring the varying contributions of individual reasoning steps to the final answer. This work proposes Outcome-grounded Advantage Reshaping (OAR), a mechanism that enables finer-grained credit assignment by evaluating each token’s influence on the outcome. OAR integrates two complementary strategies: OAR-P, based on counterfactual token perturbation, and OAR-G, leveraging input gradient sensitivity. Within a critic-free architecture, OAR further incorporates a two-level conservative advantage reshaping scheme that amplifies high-impact tokens while suppressing low-impact ones, all while preserving the total advantage magnitude. Experiments demonstrate that OAR significantly outperforms strong GRPO baselines across multiple mathematical reasoning benchmarks, with OAR-G achieving performance gains comparable to the high-fidelity OAR-P at nearly zero computational overhead.

Technology Category

Application Category

πŸ“ Abstract
Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.
Problem

Research questions and friction points this paper is trying to address.

credit assignment
mathematical reasoning
reinforcement learning
token-level contribution
outcome sensitivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Outcome-grounded Advantage Reshaping
Fine-Grained Credit Assignment
Group Relative Policy Optimization
Counterfactual Perturbation
Input-Gradient Sensitivity