🤖 AI Summary
This work addresses the challenge in existing multimodal large language model (MLLM)-based navigation systems, where tight coupling between semantic understanding and spatial perception hinders simultaneous high-level instruction comprehension and precise spatial reasoning. To overcome this limitation, the authors propose a decoupled navigation architecture that separates low-level metric spatial state estimation from high-level semantic planning. The approach introduces an explicit, interactive metric world representation to replace simplified textual maps and integrates counterfactual reasoning to enhance the MLLM’s ability to perform physically consistent reasoning. Evaluated on R2R-CE and RxR-CE benchmarks, the method achieves success rates of 48.8% and 42.2%, respectively, and demonstrates, for the first time, zero-shot sim-to-real transfer across both wheeled robots and drones, significantly improving generalization and deployment flexibility.
📝 Abstract
A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the produced actions. We conduct comprehensive experiments in both simulated and real-world environments. Our method establishes a new zero-shot state-of-the-art, achieving 48.8\% Success Rate (SR) in R2R-CE and 42.2\% in RxR-CE benchmarks. Furthermore, to validate the versatility of our metric representation, we demonstrate zero-shot sim-to-real transfer across diverse embodiments, including a wheeled TurtleBot 4 and a custom-built aerial drone. These real-world deployments verify that our decoupled framework serves as a robust, domain-invariant interface for embodied Vision-and-Language navigation.