Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work investigates the internal mechanisms by which Verifiable Reward Reinforcement Learning (RLVR) enhances the reasoning capabilities of large language models (LLMs). We propose a token-entropy–based analytical framework and identify high-entropy “branching tokens” as the primary drivers of reasoning-path selection—key to RLVR’s efficacy. Methodologically, we integrate token-level entropy estimation, chain-of-thought modeling, and sparse policy-gradient updates. Crucially, we break from conventional full-parameter fine-tuning: updating only ~20% of highest-entropy tokens surpasses full fine-tuning, with gains scaling favorably with model size. On Qwen3-32B and Qwen3-14B, our approach improves AIME’25 scores by +11.04 and +4.79 points, respectively; in contrast, updating low-entropy tokens degrades performance significantly. This study provides the first empirical evidence of an entropy-driven sparse optimization mechanism, establishing a new paradigm for efficient, interpretable LLM reasoning enhancement.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.

Problem

Research questions and friction points this paper is trying to address.

Understanding how high-entropy tokens influence LLM reasoning performance

Exploring RLVR's impact on token entropy patterns during training

Improving RLVR efficiency by focusing updates on critical forking tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR enhances LLMs via verifiable rewards

Focus policy updates on high-entropy tokens

Minority tokens drive reasoning performance

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting