🤖 AI Summary
Real-world 3D scan data suffers from limited scale, heterogeneous annotation formats, and poor tool interoperability. To address these bottlenecks, this work proposes a unified annotation integration framework built upon Universal Scene Description (USD), defining a USD variant tailored for embodied intelligence tasks to enable structured fusion of multi-source annotations. We further incorporate large language models (LLMs) for scene semantic parsing and editable representation learning, establishing an end-to-end pipeline: “scan → semantic understanding → simulation-based policy learning.” Experiments demonstrate 80% success rate on LLM-driven scene editing tasks and 87% success rate on robot policy learning in simulation—significantly improving model generalization in photorealistic environments. This is the first work to deeply integrate USD into the 3D embodied intelligence data infrastructure, introducing a scalable, interoperable paradigm for scan-driven embodied reasoning and action.
📝 Abstract
Real-world 3D scene-level scans offer realism and can enable better real-world generalizability for downstream applications. However, challenges such as data volume, diverse annotation formats, and tool compatibility limit their use. This paper demonstrates a methodology to effectively leverage these scans and their annotations. We propose a unified annotation integration using USD, with application-specific USD flavors. We identify challenges in utilizing holistic real-world scan datasets and present mitigation strategies. The efficacy of our approach is demonstrated through two downstream applications: LLM-based scene editing, enabling effective LLM understanding and adaptation of the data (80% success), and robotic simulation, achieving an 87% success rate in policy learning.