InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work investigates whether purely synthetic data can substitute real-robot data for pretraining vision-language-action (VLA) models. To this end, we design the first high-fidelity, fully automated simulation pipeline supporting multimodal, multi-skill, long-horizon embodied tasks—enabling large-scale, fully decoupled, composable, and annotation-free embodied intelligence data generation. Using data from this pipeline, we perform end-to-end pretraining with the same architecture as π₀. Experiments demonstrate that our model matches π₀’s performance across 49 simulated tasks, 5 real-world tasks, and 4 dexterous long-horizon tasks—while exhibiting exceptional zero-shot cross-domain generalization. This is the first empirical validation of the sufficiency and effectiveness of high-quality synthetic data for general-purpose VLA policy pretraining.

Technology Category

Application Category

📝 Abstract

Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strongest $π$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation. The resulting model also exhibits surprisingly zero-shot sim-to-real transfer on several challenging tasks. Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables long-horizon skill composition, flexible task assembly, and heterogeneous embodiments with minimal manual tuning. Using the same architecture as $π_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $π_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks. We release the dataset and will open-source the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.

Problem

Research questions and friction points this paper is trying to address.

Demonstrating synthetic data can match real-robot pre-training performance for VLA models

Creating scalable simulation pipeline for diverse robotic manipulation tasks and embodiments

Enabling zero-shot sim-to-real transfer for challenging robotic manipulation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data matches real-data performance in pre-training

Autonomous simulation pipeline enables scalable trajectory generation

Zero-shot sim-to-real transfer achieved across diverse manipulation tasks

🔎 Similar Papers

No similar papers found.