🤖 AI Summary
This work addresses the challenge of long-range, semantically complex object navigation across embodied agents in real-world environments by proposing a three-tier decoupled architecture. At the high level, a vision-language model generates semantic-guided scene representations; at the mid level, a room-level hierarchical navigation strategy is employed; and at the low level, an embodied adaptive motion control module is integrated. This system achieves, for the first time, efficient and robust cross-platform object navigation at building scale in real environments, effectively unifying semantic understanding and embodied control through a multi-level coordination mechanism. Extensive real-world experiments—190 trials across three robotic platforms—demonstrate significant improvements in success rate and efficiency, while the approach also attains state-of-the-art performance on four simulation benchmarks.
📝 Abstract
Object navigation (ObjectNav) in real-world environments is a complex problem that requires simultaneously addressing multiple challenges, including complex spatial structure, long-horizon planning and semantic understanding. Recent advances in Vision-Language Models (VLMs) offer promising capabilities for semantic understanding, yet effectively integrating them into real-world navigation systems remains a non-trivial challenge. In this work, we formulate real-world ObjectNav as a system-level problem and introduce SysNav, a three-level ObjectNav system designed for real-world crossembodiment deployment. SysNav decouples semantic reasoning, navigation planning and motion control to ensure robustness and generalizability. At the high-level, we summarize the environment into a structured scene representation and leverage VLMs to provide semantic-grounded navigation guidance. At the mid-level, we introduce a hierarchical room-based navigation strategy that reserves VLM guidance for room-level decisions, which effectively utilizes its reasoning ability while ensuring system efficiency. At the low-level, planned waypoints are executed through different embodiment-specific motion control modules. We deploy our system on three embodiments, a custom-built wheeled robot, the Unitree Go2 quadruped and the Unitree G1 humanoid, and conduct 190 real-world experiments. Our system achieves substantial improvements in both success rate and navigation efficiency. To the best of our knowledge, SysNav is the first system capable of reliably and efficiently completing building-scale long-range object navigation in complex real-world environments. Furthermore, extensive experiments on four simulation benchmarks demonstrate state-of-the-art performance. Project page is available at: https://cmu-vln.github.io/.