Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the limitation of existing reinforcement learning methods, such as GRPO, which rely on coarse-grained credit assignment and struggle to pinpoint specific erroneous reasoning segments. The authors propose a novel fine-grained credit assignment approach that requires no additional annotations: by computing the Wasserstein distance between span-level hidden state distributions of correct and incorrect trajectories, the method automatically identifies reasoning divergence points and reweights the advantage function accordingly. This study is the first to demonstrate that discrepancies in hidden state distributions serve as an effective self-supervised signal, supported by a theoretically grounded separation theorem. Evaluated across five mathematical reasoning and five code generation benchmarks, the proposed method significantly outperforms standard GRPO and matches the performance of process reward models that require extra training, all without relying on auxiliary models or external supervision.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) performs coarse-grained credit assignment in reinforcement learning with verifiable rewards (RLVR) by assigning the same advantage to all tokens in a rollout. Process reward models can provide finer-grained supervision, but they require step-level annotation or additional reward modeling. We show that hidden-state distributions contain a useful signal for local reasoning quality that can be extracted using only outcome-level correctness labels available in RLVR. Specifically, within each GRPO group, the Wasserstein distance between span-level hidden state distributions of correct and incorrect rollouts increases around regions where their local reasoning quality diverges. This association holds both across examples and within individual trajectories, suggesting that hidden-state distributional divergence can serve as a self-supervision signal for fine-grained credit assignment. We formalize this observation with a separation theorem showing that, under mild structural assumptions, post-divergence spans have larger Wasserstein distances than pre-divergence spans whenever the population-level distributional gap exceeds finite-sample noise. Motivated by this result, we propose \textbf{S}pan-level \textbf{H}idden state \textbf{E}nabled \textbf{A}dvantage \textbf{R}eweighting (SHEAR), which modifies GRPO by using span-level Wasserstein distances to scale token-level advantages, amplifying updates on tokens whose hidden states are more separated from the opposing group. The method requires no additional model and only minimal changes to the training pipeline. Experiments on five mathematical reasoning benchmarks and five code generation benchmarks show improvements over standard GRPO and strong performance relative to supervised process reward models, while requiring no additional annotation or reward model training.

Problem

Research questions and friction points this paper is trying to address.

credit assignment

reinforcement learning with verifiable rewards

fine-grained supervision

hidden-state distributions

reasoning quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

credit assignment

Wasserstein distance

hidden states