🤖 AI Summary
Embodied AI research suffers from manual, non-scalable, and low-generalization 3D scene construction. Method: This paper proposes Scan2Sim—a fully automated paradigm for generating interactive 3D scenes from real-world scans. We introduce MetaScenes, the first large-scale, interactive 3D scene benchmark derived from real scans, containing 15,366 objects across 831 fine-grained categories. We design Scan2Sim, a multimodal alignment model integrating CLIP and point-cloud encoders to achieve high-fidelity, semantically consistent replacement of real scans with simulation-ready assets. Our approach further incorporates joint geometric-semantic modeling, differentiable scene synthesis, and physics-aware optimization. Results: Experiments demonstrate substantial improvements in cross-domain transfer and sim-to-real generalization for robotic manipulation and vision-language navigation tasks, consistently outperforming manually constructed scene baselines across multiple benchmarks.
📝 Abstract
Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality standards, however, necessitates the precise replication of real-world object diversity. Existing datasets demonstrate that this process heavily relies on artist-driven designs, which demand substantial human effort and present significant scalability challenges. To scalably produce realistic and interactive 3D scenes, we first present MetaScenes, a large-scale, simulatable 3D scene dataset constructed from real-world scans, which includes 15366 objects spanning 831 fine-grained categories. Then, we introduce Scan2Sim, a robust multi-modal alignment model, which enables the automated, high-quality replacement of assets, thereby eliminating the reliance on artist-driven designs for scaling 3D scenes. We further propose two benchmarks to evaluate MetaScenes: a detailed scene synthesis task focused on small item layouts for robotic manipulation and a domain transfer task in vision-and-language navigation (VLN) to validate cross-domain transfer. Results confirm MetaScene's potential to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research. Project website: https://meta-scenes.github.io/.