🤖 AI Summary
This work investigates whether reinforcement learning (RL) genuinely imparts new reasoning capabilities to large language models or merely redistributes the probability mass of existing solutions during inference optimization. The study reveals that RL induces sparse corrections primarily at high-entropy decision points. Building on this insight, the authors propose ReasonMaxxer, an RL-free method that leverages the base model’s entropy signal to identify critical positions and applies contrastive loss for low-dimensional parameter updates. Requiring no online generation and relying solely on offline rollouts, ReasonMaxxer achieves performance on par with or superior to full RL across three model families, six scales, and six mathematical reasoning benchmarks—using only hundreds of rollouts and minutes of single-GPU training, thereby reducing training costs by approximately three orders of magnitude.
📝 Abstract
Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.