🤖 AI Summary
This work addresses the instability in policy optimization within verifiable reward–based reinforcement learning, where large language models often struggle due to an imbalance between exploration and exploitation on extremely hard or easy samples. To mitigate this issue, the authors propose a fine-grained exploration–exploitation mechanism grounded in perplexity-space decoupling: samples are partitioned into high-perplexity (exploration) and low-perplexity (exploitation) subspaces, and a bidirectional reward allocation strategy is introduced that minimally perturbs the verifiable rewards. Experimental results demonstrate that this approach significantly enhances model performance on mathematical reasoning and function-calling tasks, effectively validating its capability to achieve a more nuanced and stable trade-off between exploration and exploitation.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.