Targeted Exploration via Unified Entropy Control for Reinforcement Learning

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the issue of premature policy convergence and loss of exploration in reinforcement learning caused by entropy collapse. To this end, it proposes a Unified Entropy Control framework (UEC-RL), which uniquely integrates directed exploration with entropy stabilization: dynamically activating exploration on challenging prompts while employing an entropy stabilizer to prevent training instability. This approach simultaneously expands the effective search space and ensures optimization stability, circumventing the bias or variance issues inherent in existing methods. Evaluated on reasoning benchmarks such as Geometry3K, UEC-RL achieves a 37.9% improvement over GRPO in Pass@1 and consistently outperforms current RL baselines across Pass@$k$ metrics, demonstrating strong applicability to reasoning tasks with both large language models and vision-language models.

Technology Category

Application Category

📝 Abstract
Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@$k$. On Geometry3K, UEC-RL achieves a 37.9\% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC-RL as a key for scaling RL-based reasoning in large models. Our code is available at https://github.com/597358816/UEC-RL.
Problem

Research questions and friction points this paper is trying to address.

entropy collapse
premature convergence
exploration-exploitation trade-off
optimization stability
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Entropy Control
Targeted Exploration
Entropy Collapse
Reinforcement Learning
Reasoning Trajectories
🔎 Similar Papers
No similar papers found.
C
Chen Wang
College of Software, Nankai University; Zhongguancun Academy
L
Lai Wei
Zhongguancun Academy; Shanghai Jiao Tong University
Y
Yanzhi Zhang
Zhongguancun Academy; Chinese Academy of Sciences
Chenyang Shao
Chenyang Shao
PhD student, EE, Tsinghua University
Large Language ModelLLM AgentRL
Z
Zedong Dan
Zhongguancun Academy; Sun Yat-sen University
W
Weiran Huang
Zhongguancun Academy; Chinese Academy of Sciences
G
Ge Lan
College of Software, Nankai University
Y
Yue Wang
Zhongguancun Academy