Divergence-Augmented Policy Optimization

πŸ“… 2025-01-25
πŸ›οΈ Neural Information Processing Systems
πŸ“ˆ Citations: 13
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
To address the instability and premature convergence caused by reusing offline data in deep reinforcement learning, this paper proposes a Bregman divergence constraint mechanism grounded in state distribution. Differing from conventional approaches that define Bregman divergence over action probability spaces, our method is the first to formulate it over the space of state distributions induced by policies, thereby establishing a divergence-augmented policy optimization framework. By explicitly constraining the magnitude of policy updates’ impact on the induced state distribution, the approach ensures both safety and efficacy in offline data reuse. Evaluated on the Atari benchmark under data-scarce settings, our method significantly improves training stability and convergence speed, while achieving superior sample efficiency and policy robustness compared to mainstream algorithms including PPO and SAC. These results empirically validate the effectiveness and practicality of regularization at the state-distribution level.

Technology Category

Application Category

πŸ“ Abstract
In deep reinforcement learning, policy optimization methods need to deal with issues such as function approximation and the reuse of off-policy data. Standard policy gradient methods do not handle off-policy data well, leading to premature convergence and instability. This paper introduces a method to stabilize policy optimization when off-policy data are reused. The idea is to include a Bregman divergence between the behavior policy that generates the data and the current policy to ensure small and safe policy updates with off-policy data. The Bregman divergence is calculated between the state distributions of two policies, instead of only on the action probabilities, leading to a divergence augmentation formulation. Empirical experiments on Atari games show that in the data-scarce scenario where the reuse of off-policy data becomes necessary, our method can achieve better performance than other state-of-the-art deep reinforcement learning algorithms.
Problem

Research questions and friction points this paper is trying to address.

Deep Reinforcement Learning
Stable Policy Improvement
Avoidance of Premature Termination
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bregman Divergence
Stable Policy Improvement
Deep Reinforcement Learning
πŸ”Ž Similar Papers
No similar papers found.
Q
Qing Wang
Huya AI, Guangzhou, China; Tencent AI Lab, Shenzhen, China
Y
Yingru Li
The Chinese University of Hong Kong, Shenzhen, China
Jiechao Xiong
Jiechao Xiong
Tencent AI Lab
T
Tong Zhang
The Hong Kong University of Science and Technology, Hong Kong, China