Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation

πŸ“… 2025-01-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the zero-shot generalization challenge across robots, tasks, and environments in embodied manipulation, this paper proposes Video2Code: a framework that directly synthesizes executable policy code for robotic arms from input videos and free-text instructions via vision-language understanding. Our key contributions are: (1) the first end-to-end paradigm for video-to-policy-code generation; (2) RoboProβ€”the first foundational robot model supporting both visual perception and instruction-driven control, with robust generalization over APIs and skill libraries; and (3) a unified architecture integrating off-the-shelf vision-language models, large language models for code generation, an atomic skill library, and a policy execution engine. Under zero-shot evaluation on RLBench, Video2Code achieves a 11.6% higher success rate than GPT-4o and matches strong supervised baselines, establishing new state-of-the-art performance in both simulation and real-robot deployment.

Technology Category

Application Category

πŸ“ Abstract
Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to variations on API formats and skill sets.
Problem

Research questions and friction points this paper is trying to address.

Robotics
Adaptability
Unseen Environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Strategy Code Generation
RoboPro Foundational Model
Video2Code Mechanism
πŸ”Ž Similar Papers
No similar papers found.
Senwei Xie
Senwei Xie
ict, cas
Embodied AI
H
Hongyu Wang
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, 100190, China, and University of Chinese Academy of Sciences, Beijing 100049, China
Zhanqi Xiao
Zhanqi Xiao
Institute of Computing Technology, Chinese Academy of Sciences
Embodied AIRoboticsRobot Learning
Ruiping Wang
Ruiping Wang
Professor, Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine Learning
X
Xilin Chen
Key Laboratory of AI Safety of CAS, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing, 100190, China, and University of Chinese Academy of Sciences, Beijing 100049, China