🤖 AI Summary
This work addresses the challenge of limited data availability and high collection costs that hinder generalization in real-world robotic policy learning. The authors propose a generative framework that automatically transforms real-world panoramic images into high-fidelity, editable “digital twin” simulation environments. By integrating semantic and geometric editing capabilities, a high-quality physics engine, and multi-room stitching techniques, the framework supports interactive manipulation and long-horizon navigation tasks. This approach substantially improves sim-to-real consistency and enables large-scale generation of diverse scenes, thereby significantly enhancing policy generalization to unseen environments and objects.
📝 Abstract
Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.