🤖 AI Summary
This work addresses the inherent trade-off between exploration diversity and task accuracy in large language models (LLMs) within reinforcement learning settings. The authors propose a novel Policy Split paradigm that introduces, for the first time, a dual-mode policy architecture: under a shared-parameter policy network, it simultaneously maintains a normal mode optimized for task accuracy and a high-entropy mode dedicated to promoting exploration. These modes are co-trained through dual-mode entropy regularization and high-entropy prompting, enabling decoupled objective optimization and differentiated behavior generation. Empirical results demonstrate that this approach consistently outperforms existing entropy-based reinforcement learning baselines across various model scales and both general and creative tasks, significantly enhancing both exploration efficiency and overall task performance.
📝 Abstract
To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.