Personalisation via Dynamic Policy Fusion

📅 2024-09-30

🏛️ International Conference on Human-Agent Interaction

📈 Citations: 1

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Deep reinforcement learning (DRL) policies often achieve optimal task performance but struggle to accommodate user-specific preferences; explicitly modeling user reward functions is challenging, and retraining from scratch is computationally expensive. To address this, we propose a reward-free dynamic policy fusion framework that leverages trajectory-level user feedback. Our method jointly integrates inverse reinforcement learning (IRL) with differentiable policy interpolation to end-to-end learn adaptive fusion weights between a pre-trained task policy and user intent. Crucially, this enables online, gradient-based optimization of the trade-off between task objectives and personalized preferences—a capability not previously realized. Experiments across multiple RL domains demonstrate that our approach maintains original task performance with less than 5% degradation while significantly improving personalized satisfaction. It consistently outperforms both static policy fusion and full-policy retraining baselines in robustness and effectiveness.

Technology Category

Application Category

📝 Abstract

Reward-optimal policies obtained by training deep reinforcement learning agents may not be aligned with one’s personal preferences. Rectifying this by retraining the agent with a user-specific reward function is impractical, as such functions are not readily available. In addition, retraining is associated with high costs. Instead, we propose to adapt via policy fusion, the already trained policy with the user’s intent, which we in turn infer from trajectory-level feedback. We design the policy fusion process to be dynamic, such that the resulting policy is neither dominated by the task goals nor the user needs. We empirically demonstrate that our method consistently balances these objectives across various environments.

Problem

Research questions and friction points this paper is trying to address.

Aligning RL policies with user preferences without retraining

Inferring user intent via trajectory-level human feedback

Dynamic policy fusion for zero-shot adaptation to user needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic policy fusion for user alignment

Infer user intent via trajectory feedback

Zero-shot adaptation without re-interaction

🔎 Similar Papers

CRoP: Context-wise Robust Static Human-Sensing Personalization