🤖 AI Summary
Deep reinforcement learning (DRL) policies often achieve optimal task performance but struggle to accommodate user-specific preferences; explicitly modeling user reward functions is challenging, and retraining from scratch is computationally expensive. To address this, we propose a reward-free dynamic policy fusion framework that leverages trajectory-level user feedback. Our method jointly integrates inverse reinforcement learning (IRL) with differentiable policy interpolation to end-to-end learn adaptive fusion weights between a pre-trained task policy and user intent. Crucially, this enables online, gradient-based optimization of the trade-off between task objectives and personalized preferences—a capability not previously realized. Experiments across multiple RL domains demonstrate that our approach maintains original task performance with less than 5% degradation while significantly improving personalized satisfaction. It consistently outperforms both static policy fusion and full-policy retraining baselines in robustness and effectiveness.
📝 Abstract
Reward-optimal policies obtained by training deep reinforcement learning agents may not be aligned with one’s personal preferences. Rectifying this by retraining the agent with a user-specific reward function is impractical, as such functions are not readily available. In addition, retraining is associated with high costs. Instead, we propose to adapt via policy fusion, the already trained policy with the user’s intent, which we in turn infer from trajectory-level feedback. We design the policy fusion process to be dynamic, such that the resulting policy is neither dominated by the task goals nor the user needs. We empirically demonstrate that our method consistently balances these objectives across various environments.