🤖 AI Summary
Existing reinforcement learning and imitation learning approaches struggle to simultaneously satisfy novel task requirements and preserve desirable properties of prior policies, particularly lacking residual learning methods tailored for policy gradient frameworks. To address this, we propose Residual Policy Gradients (RPG), the first method to integrate the residual Q-learning paradigm into the policy gradient framework. RPG explicitly models a residual action distribution during policy updates, enabling controlled refinement of prior policies. Theoretically, we reinterpret the KL-regularized objective, uncovering its implicit maximum-entropy trade-off mechanism. Algorithmically, RPG unifies soft policy gradients with residual Q-value estimation to yield a differentiable, stable, and constraint-aware optimization objective. Evaluated on the MuJoCo benchmark, RPG significantly improves both stability and task adaptability in policy customization, establishing a new paradigm for gradient-based policy transfer.
📝 Abstract
Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. One of the main issues is the additional requirements that were not considered during training. To address this challenge, policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. However, RQL has not yet been applied to policy gradient methods, which restricts its applicability, especially in tasks where policy gradient has already proven more effective. In this work, we first derive a concise form of Soft Policy Gradient as a preliminary. Building on this, we introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods, allowing policy customization in gradient-based RL settings. With the view of RPG, we rethink the KL-regularized objective widely used in RL fine-tuning. We show that under certain assumptions, KL-regularized objective leads to a maximum-entropy policy that balances the inherent properties and task-specific requirements on a reward-level. Our experiments in MuJoCo demonstrate the effectiveness of Soft Policy Gradient and Residual Policy Gradient.