DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

241K/year
🤖 AI Summary
This work addresses the instability in policy optimization within verifiable reward–based reinforcement learning, where large language models often struggle due to an imbalance between exploration and exploitation on extremely hard or easy samples. To mitigate this issue, the authors propose a fine-grained exploration–exploitation mechanism grounded in perplexity-space decoupling: samples are partitioned into high-perplexity (exploration) and low-perplexity (exploitation) subspaces, and a bidirectional reward allocation strategy is introduced that minimally perturbs the verifiable rewards. Experimental results demonstrate that this approach significantly enhances model performance on mathematical reasoning and function-calling tasks, effectively validating its capability to achieve a more nuanced and stable trade-off between exploration and exploitation.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.
Problem

Research questions and friction points this paper is trying to address.

exploration-exploitation trade-off
reinforcement learning
large language models
perplexity
fine-grained optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled Perplexity
Exploration-Exploitation Trade-off
Policy Optimization
Reinforcement Learning with Verifiable Rewards
Large Language Models
🔎 Similar Papers
No similar papers found.