🤖 AI Summary
This work identifies a hierarchical reasoning mechanism that emerges when reinforcement learning (RL) enhances large language models’ (LLMs) complex reasoning capabilities: low-level token generation is decoupled from high-level strategic planning. We find that standard RL algorithms—e.g., GRPO—dilute critical credit signals by uniformly distributing gradients across all tokens, severely hindering efficient exploration of high-level planning policies. To address this, we propose Hierarchical Credit Assignment (HICRA), a hierarchy-aware algorithm that dynamically identifies high-impact planning tokens via semantic entropy and concentrates gradient updates on these strategic decision points. Experiments across diverse complex reasoning benchmarks demonstrate that HICRA significantly improves both sample efficiency and out-of-distribution generalization, surpassing strong baselines. Our approach provides a novel, interpretable paradigm for understanding and enhancing structured reasoning in LLMs through principled credit assignment.
📝 Abstract
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose HIerarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. HICRA significantly outperforms strong baselines, demonstrating that focusing on this strategic bottleneck is key to unlocking advanced reasoning. Furthermore, we validate semantic entropy as a superior compass for measuring strategic exploration over misleading metrics such as token-level entropy.