A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-world robotic manipulation faces challenges including ambiguous task specification, difficulty in aligning with human intent, and limited adaptability to dynamic environments. This paper proposes the Vision-Language Model–driven Iterative Keypoint-based Reward (IKER) framework, which jointly models natural language instructions and RGB-D observations to automatically synthesize differentiable, spatially grounded Python reward functions—enabling a closed loop from real-world demonstration to simulation-based policy training and real-robot deployment. IKER introduces the first VLM-based spatial relation reward generation and iterative reward refinement mechanism, incorporating commonsense priors to support SE(3)-accurate control, multi-step task planning, online policy adaptation, and autonomous error recovery. Experiments demonstrate significant improvements in success rate and environmental robustness on both pre-grasp and non-pre-grasp tasks, providing the first empirical validation of iterative reward shaping on physical robots.

Technology Category

Application Category

📝 Abstract
Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.
Problem

Research questions and friction points this paper is trying to address.

Develops adaptive robotic manipulation task specifications.
Generates iterative keypoint rewards using VLMs.
Enables multi-step task execution in dynamic environments.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Python-based Iterative Keypoint Reward function
Utilizes Vision-Language Models for reward generation
Real-to-sim-to-real reinforcement learning training loop
🔎 Similar Papers
No similar papers found.
S
Shivansh Patel
University of Illinois at Urbana-Champaign
Xi Yin
Xi Yin
Research Scientist, Facebook
Computer VisionMachine LearningDeep Learning
Wenlong Huang
Wenlong Huang
Stanford University
RoboticsMachine LearningFoundation Models
S
Shubham Garg
Amazon
H
H. Nayyeri
Amazon
F
Fei-Fei Li
Stanford University
S
S. Lazebnik
University of Illinois at Urbana-Champaign
Yunzhu Li
Yunzhu Li
Columbia University
RoboticsComputer VisionMachine Learning