DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the instability in policy optimization within verifiable reward–based reinforcement learning, where large language models often struggle due to an imbalance between exploration and exploitation on extremely hard or easy samples. To mitigate this issue, the authors propose a fine-grained exploration–exploitation mechanism grounded in perplexity-space decoupling: samples are partitioned into high-perplexity (exploration) and low-perplexity (exploitation) subspaces, and a bidirectional reward allocation strategy is introduced that minimally perturbs the verifiable rewards. Experimental results demonstrate that this approach significantly enhances model performance on mathematical reasoning and function-calling tasks, effectively validating its capability to achieve a more nuanced and stable trade-off between exploration and exploitation.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

Problem

Research questions and friction points this paper is trying to address.

exploration-exploitation trade-off

reinforcement learning

large language models

perplexity

fine-grained optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled Perplexity

Exploration-Exploitation Trade-off

Policy Optimization