Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the inherent trade-off between exploration diversity and task accuracy in large language models (LLMs) within reinforcement learning settings. The authors propose a novel Policy Split paradigm that introduces, for the first time, a dual-mode policy architecture: under a shared-parameter policy network, it simultaneously maintains a normal mode optimized for task accuracy and a high-entropy mode dedicated to promoting exploration. These modes are co-trained through dual-mode entropy regularization and high-entropy prompting, enabling decoupled objective optimization and differentiated behavior generation. Empirical results demonstrate that this approach consistently outperforms existing entropy-based reinforcement learning baselines across various model scales and both general and creative tasks, significantly enhancing both exploration efficiency and overall task performance.

Technology Category

Application Category

📝 Abstract

To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

large language models

exploration

entropy regularization

policy diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy Split

dual-mode exploration

entropy regularization