🤖 AI Summary
The exploration mechanisms of large language models (LLMs) in reinforcement learning with verifiable rewards (RLVR) remain poorly understood and lack systematic, quantitative analysis.
Method: This paper proposes a quantifiable framework for analyzing LLM exploration behavior: (1) an entropy dynamics model operating at both training-stage and token-level granularity to characterize exploration breadth and depth; (2) a boundary metric for the exploration space to uncover the entropy–performance trade-off; and (3) a transformation paradigm linking “exploration gain → reasoning optimization → performance improvement.” The approach integrates rule-based reward feedback RL, instance- and token-level entropy estimation, and explicit exploration-space modeling, validated empirically.
Contribution/Results: This work provides the first systematic characterization of the trial-and-error–driven refinement mechanism underlying LLMs in RLVR. It establishes theoretical foundations and delivers reusable tools for controllable generation and optimization of complex reasoning chains.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR's empirical success, the fundamental mechanisms governing LLMs' exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs' capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.