🤖 AI Summary
Existing dual-arm robotic synthetic data exhibits poor generalization, primarily due to: (1) the absence of efficient, scalable methods for generating novel-task data; and (2) excessive simplification and severe distortion in simulation environments. This paper introduces RoboTwin—a novel framework addressing these limitations. It constructs RoboTwin-OD, a large-scale real-world object dataset; designs a multimodal large language model–driven code-generation pipeline; and pioneers a five-axis structured domain randomization strategy to significantly enhance cross-scenario generalization. RoboTwin supports over 50 dual-arm manipulation tasks across diverse robot morphologies and generates more than 100,000 expert demonstration trajectories. Evaluation shows that fine-tuned models achieve a 367% performance gain on unseen real-world scenes, zero-shot models improve by 228%, and code-generation success rate increases by 10.9%.
📝 Abstract
Simulation-based data synthesis has emerged as a powerful paradigm for enhancing real-world robotic manipulation. However, existing synthetic datasets remain insufficient for robust bimanual manipulation due to two challenges: (1) the lack of an efficient, scalable data generation method for novel tasks, and (2) oversimplified simulation environments that fail to capture real-world complexity. We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data, along with unified evaluation protocols for dual-arm manipulation. We first construct RoboTwin-OD, a large-scale object library comprising 731 instances across 147 categories, each annotated with semantic and manipulation-relevant labels. Building on this foundation, we develop an expert data synthesis pipeline that combines multimodal large language models (MLLMs) with simulation-in-the-loop refinement to generate task-level execution code automatically. To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions, thereby enhancing data diversity and policy robustness. We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories. Empirical results show a 10.9% gain in code generation success and improved generalization to novel real-world scenarios. A VLA model fine-tuned on our dataset achieves a 367% relative improvement (42.0% vs. 9.0%) on unseen scene real-world tasks, while zero-shot models trained solely on our synthetic data achieve a 228% relative gain, highlighting strong generalization without real-world supervision. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation.