🤖 AI Summary
This work investigates whether purely synthetic data can substitute real-robot data for pretraining vision-language-action (VLA) models. To this end, we design the first high-fidelity, fully automated simulation pipeline supporting multimodal, multi-skill, long-horizon embodied tasks—enabling large-scale, fully decoupled, composable, and annotation-free embodied intelligence data generation. Using data from this pipeline, we perform end-to-end pretraining with the same architecture as π₀. Experiments demonstrate that our model matches π₀’s performance across 49 simulated tasks, 5 real-world tasks, and 4 dexterous long-horizon tasks—while exhibiting exceptional zero-shot cross-domain generalization. This is the first empirical validation of the sufficiency and effectiveness of high-quality synthetic data for general-purpose VLA policy pretraining.
📝 Abstract
Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strongest $π$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation. The resulting model also exhibits surprisingly zero-shot sim-to-real transfer on several challenging tasks. Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables long-horizon skill composition, flexible task assembly, and heterogeneous embodiments with minimal manual tuning. Using the same architecture as $π_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $π_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks. We release the dataset and will open-source the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.