🤖 AI Summary
This work addresses the disconnect between token-level behavior and internal mechanisms in large reasoning models, as well as the instability in reinforcement learning training caused by reliance on external verifiers. The authors identify and formally define a novel phenomenon—“entropy-gradient inversion”—characterized by a strong negative correlation between token entropy and logit gradients, which they interpret as a geometric fingerprint of a model’s reasoning capability. Building on this insight, they propose CorR-PO, an algorithm that incorporates this intrinsic signal into reward regularization to stabilize reasoning optimization. Experiments across multiple model scales and reasoning benchmarks demonstrate that CorR-PO significantly outperforms existing methods, establishing a direct link between the strength of entropy-gradient inversion and reasoning performance, thereby transcending conventional reinforcement learning paradigms that depend on external supervision.
📝 Abstract
The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.