🤖 AI Summary
This work addresses the challenge of efficiently adapting general-purpose imitation policies to novel task objectives and constraints while maintaining data efficiency and deployment robustness. The authors propose an instruction-conditioned policy optimization framework that integrates imitation learning with reinforcement learning, leveraging natural language task descriptions to automatically generate reward functions. For the first time, this approach combines human feedback on intermediate trajectories with a Eureka-style reward generation mechanism to enable personalized policy refinement. Evaluated on simulated pick-and-place tasks, the method significantly outperforms feedback-free baselines, achieving enhanced robustness with reduced computational overhead and enabling efficient reuse of general policies across diverse task configurations.
📝 Abstract
This paper presents PRISM: an instruction-conditioned refinement method for imitation policies in robotic manipulation. This approach bridges Imitation Learning (IL) and Reinforcement Learning (RL) frameworks into a seamless pipeline, such that an imitation policy on a broad generic task, generated from a set of user-guided demonstrations, can be refined through reinforcement to generate new unseen fine-grain behaviours. The refinement process follows the Eureka paradigm, where reward functions for RL are iteratively generated from an initial natural-language task description. Presented approach, builds on top of this mechanism to adapt a refined IL policy of a generic task to new goal configurations and the introduction of constraints by adding also human feedback correction on intermediate rollouts, enabling policy reusability and therefore data efficiency. Results for a pick-and-place task in a simulated scenario show that proposed method outperforms policies without human feedback, improving robustness on deployment and reducing computational burden.