MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans

📅 2025-05-05

📈 Citations: 2

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Embodied AI research suffers from manual, non-scalable, and low-generalization 3D scene construction. Method: This paper proposes Scan2Sim—a fully automated paradigm for generating interactive 3D scenes from real-world scans. We introduce MetaScenes, the first large-scale, interactive 3D scene benchmark derived from real scans, containing 15,366 objects across 831 fine-grained categories. We design Scan2Sim, a multimodal alignment model integrating CLIP and point-cloud encoders to achieve high-fidelity, semantically consistent replacement of real scans with simulation-ready assets. Our approach further incorporates joint geometric-semantic modeling, differentiable scene synthesis, and physics-aware optimization. Results: Experiments demonstrate substantial improvements in cross-domain transfer and sim-to-real generalization for robotic manipulation and vision-language navigation tasks, consistently outperforming manually constructed scene baselines across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Embodied AI (EAI) research requires high-quality, diverse 3D scenes to effectively support skill acquisition, sim-to-real transfer, and generalization. Achieving these quality standards, however, necessitates the precise replication of real-world object diversity. Existing datasets demonstrate that this process heavily relies on artist-driven designs, which demand substantial human effort and present significant scalability challenges. To scalably produce realistic and interactive 3D scenes, we first present MetaScenes, a large-scale, simulatable 3D scene dataset constructed from real-world scans, which includes 15366 objects spanning 831 fine-grained categories. Then, we introduce Scan2Sim, a robust multi-modal alignment model, which enables the automated, high-quality replacement of assets, thereby eliminating the reliance on artist-driven designs for scaling 3D scenes. We further propose two benchmarks to evaluate MetaScenes: a detailed scene synthesis task focused on small item layouts for robotic manipulation and a domain transfer task in vision-and-language navigation (VLN) to validate cross-domain transfer. Results confirm MetaScene's potential to enhance EAI by supporting more generalizable agent learning and sim-to-real applications, introducing new possibilities for EAI research. Project website: https://meta-scenes.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Automating high-quality 3D scene replication for Embodied AI

Reducing reliance on artist-driven designs for scalability

Enhancing sim-to-real transfer and generalization in 3D environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale simulatable 3D scene dataset

Automated asset replacement via Scan2Sim

Multi-modal alignment for realistic replication

🔎 Similar Papers

No similar papers found.