Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

📅 2025-09-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a hierarchical reasoning mechanism that emerges when reinforcement learning (RL) enhances large language models’ (LLMs) complex reasoning capabilities: low-level token generation is decoupled from high-level strategic planning. We find that standard RL algorithms—e.g., GRPO—dilute critical credit signals by uniformly distributing gradients across all tokens, severely hindering efficient exploration of high-level planning policies. To address this, we propose Hierarchical Credit Assignment (HICRA), a hierarchy-aware algorithm that dynamically identifies high-impact planning tokens via semantic entropy and concentrates gradient updates on these strategic decision points. Experiments across diverse complex reasoning benchmarks demonstrate that HICRA significantly improves both sample efficiency and out-of-distribution generalization, surpassing strong baselines. Our approach provides a novel, interpretable paradigm for understanding and enhancing structured reasoning in LLMs through principled credit assignment.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. HICRA significantly outperforms strong baselines, demonstrating that focusing on this strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we validate semantic entropy as a superior compass for measuring strategic exploration over misleading metrics such as token-level entropy.
Problem

Research questions and friction points this paper is trying to address.

Understanding emergent hierarchical reasoning mechanisms in LLMs
Addressing inefficiency in RL algorithms for strategic planning
Developing better metrics for measuring strategic exploration
Innovation

Methods, ideas, or system contributions that make the work stand out.

HICRA algorithm for credit assignment
Focuses on high-impact planning tokens
Uses semantic entropy for exploration
🔎 Similar Papers
No similar papers found.