๐ค AI Summary
This work addresses the challenge of bridging large language models (LLMs) with the physical world to enable robots to understand and autonomously execute natural-language instructions. To this end, we propose a novel paradigm integrating implicit neural scene modeling with 3D language grounding: (1) compact, geometrically consistent implicit scene representations are constructed via self-supervised camera calibration, high-fidelity depth field estimation, and large-scale scene reconstruction; (2) an LLM inference mechanism enhanced with spatial awareness is designed, coupled with a navigation-oriented 3D languageโaction alignment benchmark and a closed-loop state feedback framework. Experiments demonstrate substantial improvements in spatial comprehension accuracy and cross-scale navigation robustness under complex, long-horizon instructions. Our approach provides a scalable technical pathway toward embodied intelligence for generative AI.
๐ Abstract
This thesis introduces "Embodied Spatial Intelligence" to address the challenge of creating robots that can perceive and act in the real world based on natural language instructions. To bridge the gap between Large Language Models (LLMs) and physical embodiment, we present contributions on two fronts: scene representation and spatial reasoning. For perception, we develop robust, scalable, and accurate scene representations using implicit neural models, with contributions in self-supervised camera calibration, high-fidelity depth field generation, and large-scale reconstruction. For spatial reasoning, we enhance the spatial capabilities of LLMs by introducing a novel navigation benchmark, a method for grounding language in 3D, and a state-feedback mechanism to improve long-horizon decision-making. This work lays a foundation for robots that can robustly perceive their surroundings and intelligently act upon complex, language-based commands.