🤖 AI Summary
Open-world robotic manipulation faces challenges including ambiguous task specification, difficulty in aligning with human intent, and limited adaptability to dynamic environments. This paper proposes the Vision-Language Model–driven Iterative Keypoint-based Reward (IKER) framework, which jointly models natural language instructions and RGB-D observations to automatically synthesize differentiable, spatially grounded Python reward functions—enabling a closed loop from real-world demonstration to simulation-based policy training and real-robot deployment. IKER introduces the first VLM-based spatial relation reward generation and iterative reward refinement mechanism, incorporating commonsense priors to support SE(3)-accurate control, multi-step task planning, online policy adaptation, and autonomous error recovery. Experiments demonstrate significant improvements in success rate and environmental robustness on both pre-grasp and non-pre-grasp tasks, providing the first empirical validation of iterative reward shaping on physical robots.
📝 Abstract
Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.