IGen: Scalable Data Generation for Robot Learning from Open-World Images

πŸ“… 2025-12-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity of real-world embodied interaction data in robot learning, this paper introduces the first end-to-end framework for synthesizing high-fidelity visual–action data directly from unlabeled open-world images. Methodologically, it first reconstructs 2D images into structured 3D scenes; then leverages vision-language models to interpret task instructions and jointly performs high-level planning and low-level motion optimization to generate semantically coherent and kinematically feasible SE(3) end-effector pose sequences; finally, it produces temporally coherent dynamic visual observations via differentiable rendering. The core contribution is a closed-loop synthesis pipeline bridging β€œopen-world images β†’ 3D scene β†’ task-driven actions β†’ embodied video.” Experiments demonstrate that policies trained solely on synthetic data achieve performance on par with real-data baselines across diverse manipulation tasks and exhibit significantly improved cross-task generalization, validating both the fidelity and training utility of the synthesized data.

Technology Category

Application Category

πŸ“ Abstract
The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training.
Problem

Research questions and friction points this paper is trying to address.

Generates robot training data from open-world images
Converts 2D images to 3D scenes for manipulation tasks
Synthesizes executable actions and visual observations for policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates robot actions from open-world images
Uses 3D scene representations and vision-language models
Synthesizes visual observations and SE(3) pose sequences
πŸ”Ž Similar Papers
No similar papers found.
C
Chenghao Gu
Tsinghua University
H
Haolan Kang
HKU
J
Junchao Lin
Beijing University of Chemical Technology
J
Jinghe Wang
Tsinghua University
D
Duo Wu
Tsinghua University
Shuzhao Xie
Shuzhao Xie
Tsinghua University
GraphicsMultimedia
Fanding Huang
Fanding Huang
Tsinghua University
Semantic SegmentationTest-time AdaptationLarge Language Models
J
Junchen Ge
Tsinghua University
Ziyang Gong
Ziyang Gong
SJTU, THU, Shanghai AI Lab (OpenGVLab), SYSU
Embodied Spatial Intelligence
Letian Li
Letian Li
Tsinghua University
H
Hongying Zheng
Shenzhen University of Infomation Technology
C
Changwei Lv
Shenzhen University of Infomation Technology
Z
Zhi Wang
Tsinghua University