🤖 AI Summary
In robot imitation learning, pre-trained policies often fail to adapt to new user preferences without degrading original task performance. This paper proposes a novel human-preference fine-tuning framework for diffusion-based policies. First, a differentiable reward function is learned from pairwise preference comparisons; then, the diffusion policy is efficiently fine-tuned via KL-regularized reinforcement learning (PPO or SAC), where the KL divergence constrains updates to preserve task performance. This work pioneers the integration of preference learning with diffusion policy adaptation and introduces task-preserving KL regularization to jointly achieve fine-grained behavioral alignment and policy stability. Evaluated across diverse robotic manipulation tasks, the method maintains task success rates above 92%, improves preference alignment accuracy by 37%, and significantly mitigates overfitting to preference data.
📝 Abstract
Imitation learning from human demonstrations enables robots to perform complex manipulation tasks and has recently witnessed huge success. However, these techniques often struggle to adapt behavior to new preferences or changes in the environment. To address these limitations, we propose Fine-tuning Diffusion Policy with Human Preference (FDPP). FDPP learns a reward function through preference-based learning. This reward is then used to fine-tune the pre-trained policy with reinforcement learning (RL), resulting in alignment of pre-trained policy with new human preferences while still solving the original task. Our experiments across various robotic tasks and preferences demonstrate that FDPP effectively customizes policy behavior without compromising performance. Additionally, we show that incorporating Kullback-Leibler (KL) regularization during fine-tuning prevents over-fitting and helps maintain the competencies of the initial policy.