SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of long-range, semantically complex object navigation across embodied agents in real-world environments by proposing a three-tier decoupled architecture. At the high level, a vision-language model generates semantic-guided scene representations; at the mid level, a room-level hierarchical navigation strategy is employed; and at the low level, an embodied adaptive motion control module is integrated. This system achieves, for the first time, efficient and robust cross-platform object navigation at building scale in real environments, effectively unifying semantic understanding and embodied control through a multi-level coordination mechanism. Extensive real-world experiments—190 trials across three robotic platforms—demonstrate significant improvements in success rate and efficiency, while the approach also attains state-of-the-art performance on four simulation benchmarks.

Technology Category

Application Category

📝 Abstract
Object navigation (ObjectNav) in real-world environments is a complex problem that requires simultaneously addressing multiple challenges, including complex spatial structure, long-horizon planning and semantic understanding. Recent advances in Vision-Language Models (VLMs) offer promising capabilities for semantic understanding, yet effectively integrating them into real-world navigation systems remains a non-trivial challenge. In this work, we formulate real-world ObjectNav as a system-level problem and introduce SysNav, a three-level ObjectNav system designed for real-world crossembodiment deployment. SysNav decouples semantic reasoning, navigation planning and motion control to ensure robustness and generalizability. At the high-level, we summarize the environment into a structured scene representation and leverage VLMs to provide semantic-grounded navigation guidance. At the mid-level, we introduce a hierarchical room-based navigation strategy that reserves VLM guidance for room-level decisions, which effectively utilizes its reasoning ability while ensuring system efficiency. At the low-level, planned waypoints are executed through different embodiment-specific motion control modules. We deploy our system on three embodiments, a custom-built wheeled robot, the Unitree Go2 quadruped and the Unitree G1 humanoid, and conduct 190 real-world experiments. Our system achieves substantial improvements in both success rate and navigation efficiency. To the best of our knowledge, SysNav is the first system capable of reliably and efficiently completing building-scale long-range object navigation in complex real-world environments. Furthermore, extensive experiments on four simulation benchmarks demonstrate state-of-the-art performance. Project page is available at: https://cmu-vln.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Object Navigation
Cross-Embodiment
Real-World Navigation
Long-Horizon Planning
Semantic Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic Cooperation
Cross-Embodiment Navigation
Vision-Language Models
Hierarchical Navigation
Real-World ObjectNav
Haokun Zhu
Haokun Zhu
Carnegie Mellon University
Computer VisionEmbodied AI
Z
Zongtai Li
Carnegie Mellon University
Z
Zihan Liu
Carnegie Mellon University, New York University
K
Kevin Guo
Carnegie Mellon University
Z
Zhengzhi Lin
Carnegie Mellon University
Y
Yuxin Cai
Carnegie Mellon University, Nanyang Technological University
G
Guofei Chen
Carnegie Mellon University
C
Chen Lv
Nanyang Technological University
Wenshan Wang
Wenshan Wang
Carnegie Mellon University
RoboticsMachine LearningArtificial Intelligence
Jean Oh
Jean Oh
Robotics Institute, Carnegie Mellon University
RoboticsMultimodal PerceptionSocial NavigationLanguage-Vision intersectionArtificial Intelligence
Ji Zhang
Ji Zhang
Carnegie Mellon University
SLAMNavigationPlanningExplorationScene Understanding