Entropy-Preserving Reinforcement Learning

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the inherent tendency of policy gradient algorithms to reduce policy entropy during training, which undermines exploration diversity and continual learning capabilities. The authors systematically analyze how policy gradient objectives influence entropy dynamics and propose an explicit control mechanism to mitigate this issue. They introduce the REPO family of algorithms together with ADAPO, an adaptive asymmetric clipping method, to effectively regulate policy entropy throughout training. By integrating advantage function correction, adaptive clipping, and entropy dynamics modeling, the proposed approach maintains high policy diversity while enhancing trainability. As a result, the learned policies achieve superior performance and demonstrate stronger continual learning abilities in novel environments.

Technology Category

Application Category

📝 Abstract

Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy -- and thus the diversity of explored trajectories -- as part of training, yielding a policy increasingly limited in its ability to explore. In this paper, we argue that entropy should be actively monitored and controlled throughout training. We formally analyze the contributions of leading policy gradient objectives on entropy dynamics, identify empirical factors (such as numerical precision) that significantly impact entropy behavior, and propose explicit mechanisms for entropy control. These include REPO, a family of algorithms that modify the advantage function to regulate entropy, and ADAPO, an adaptive asymmetric clipping approach. Models trained with our entropy-preserving methods maintain diversity throughout training, yielding final policies that are more performant and retain their trainability for sequential learning in new environments.

Problem

Research questions and friction points this paper is trying to address.

entropy

policy gradient

reinforcement learning

trajectory diversity

exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

entropy-preserving

policy gradient

REPO