π€ AI Summary
To address the zero-shot generalization challenge across robots, tasks, and environments in embodied manipulation, this paper proposes Video2Code: a framework that directly synthesizes executable policy code for robotic arms from input videos and free-text instructions via vision-language understanding. Our key contributions are: (1) the first end-to-end paradigm for video-to-policy-code generation; (2) RoboProβthe first foundational robot model supporting both visual perception and instruction-driven control, with robust generalization over APIs and skill libraries; and (3) a unified architecture integrating off-the-shelf vision-language models, large language models for code generation, an atomic skill library, and a policy execution engine. Under zero-shot evaluation on RLBench, Video2Code achieves a 11.6% higher success rate than GPT-4o and matches strong supervised baselines, establishing new state-of-the-art performance in both simulation and real-robot deployment.
π Abstract
Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses the state-of-the-art model GPT-4o by 11.6%, which is even comparable to a strong supervised training baseline. Furthermore, RoboPro is robust to variations on API formats and skill sets.