IGen: Scalable Data Generation for Robot Learning from Open-World Images

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

To address the scarcity of real-world embodied interaction data in robot learning, this paper introduces the first end-to-end framework for synthesizing high-fidelity visual–action data directly from unlabeled open-world images. Methodologically, it first reconstructs 2D images into structured 3D scenes; then leverages vision-language models to interpret task instructions and jointly performs high-level planning and low-level motion optimization to generate semantically coherent and kinematically feasible SE(3) end-effector pose sequences; finally, it produces temporally coherent dynamic visual observations via differentiable rendering. The core contribution is a closed-loop synthesis pipeline bridging “open-world images → 3D scene → task-driven actions → embodied video.” Experiments demonstrate that policies trained solely on synthetic data achieve performance on par with real-data baselines across diverse manipulation tasks and exhibit significantly improved cross-task generalization, validating both the fidelity and training utility of the synthesized data.

Technology Category

Application Category

📝 Abstract

The rise of generalist robotic policies has created an exponential demand for large-scale training data. However, on-robot data collection is labor-intensive and often limited to specific environments. In contrast, open-world images capture a vast diversity of real-world scenes that naturally align with robotic manipulation tasks, offering a promising avenue for low-cost, large-scale robot data acquisition. Despite this potential, the lack of associated robot actions hinders the practical use of open-world images for robot learning, leaving this rich visual resource largely unexploited. To bridge this gap, we propose IGen, a framework that scalably generates realistic visual observations and executable actions from open-world images. IGen first converts unstructured 2D pixels into structured 3D scene representations suitable for scene understanding and manipulation. It then leverages the reasoning capabilities of vision-language models to transform scene-specific task instructions into high-level plans and generate low-level actions as SE(3) end-effector pose sequences. From these poses, it synthesizes dynamic scene evolution and renders temporally coherent visual observations. Experiments validate the high quality of visuomotor data generated by IGen, and show that policies trained solely on IGen-synthesized data achieve performance comparable to those trained on real-world data. This highlights the potential of IGen to support scalable data generation from open-world images for generalist robotic policy training.

Problem

Research questions and friction points this paper is trying to address.

Generates robot training data from open-world images

Converts 2D images to 3D scenes for manipulation tasks

Synthesizes executable actions and visual observations for policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates robot actions from open-world images

Uses 3D scene representations and vision-language models

Synthesizes visual observations and SE(3) pose sequences

🔎 Similar Papers

No similar papers found.