🤖 AI Summary
This work investigates the internal mechanisms by which Verifiable Reward Reinforcement Learning (RLVR) enhances the reasoning capabilities of large language models (LLMs). We propose a token-entropy–based analytical framework and identify high-entropy “branching tokens” as the primary drivers of reasoning-path selection—key to RLVR’s efficacy. Methodologically, we integrate token-level entropy estimation, chain-of-thought modeling, and sparse policy-gradient updates. Crucially, we break from conventional full-parameter fine-tuning: updating only ~20% of highest-entropy tokens surpasses full fine-tuning, with gains scaling favorably with model size. On Qwen3-32B and Qwen3-14B, our approach improves AIME’25 scores by +11.04 and +4.79 points, respectively; in contrast, updating low-entropy tokens degrades performance significantly. This study provides the first empirical evidence of an entropy-driven sparse optimization mechanism, establishing a new paradigm for efficient, interpretable LLM reasoning enhancement.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.