One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in existing multimodal large language model (MLLM)-based navigation systems, where tight coupling between semantic understanding and spatial perception hinders simultaneous high-level instruction comprehension and precise spatial reasoning. To overcome this limitation, the authors propose a decoupled navigation architecture that separates low-level metric spatial state estimation from high-level semantic planning. The approach introduces an explicit, interactive metric world representation to replace simplified textual maps and integrates counterfactual reasoning to enhance the MLLM’s ability to perform physically consistent reasoning. Evaluated on R2R-CE and RxR-CE benchmarks, the method achieves success rates of 48.8% and 42.2%, respectively, and demonstrates, for the first time, zero-shot sim-to-real transfer across both wheeled robots and drones, significantly improving generalization and deployment flexibility.

Technology Category

Application Category

📝 Abstract
A navigable agent needs to understand both high-level semantic instructions and precise spatial perceptions. Building navigation agents centered on Multimodal Large Language Models (MLLMs) demonstrates a promising solution due to their powerful generalization ability. However, the current tightly coupled design dramatically limits system performance. In this work, we propose a decoupled design that separates low-level spatial state estimation from high-level semantic planning. Unlike previous methods that rely on predefined, oversimplified textual maps, we introduce an interactive metric world representation that maintains rich and consistent information, allowing MLLMs to interact with and reason on it for decision-making. Furthermore, counterfactual reasoning is introduced to further elicit MLLMs' capacity, while the metric world representation ensures the physical validity of the produced actions. We conduct comprehensive experiments in both simulated and real-world environments. Our method establishes a new zero-shot state-of-the-art, achieving 48.8\% Success Rate (SR) in R2R-CE and 42.2\% in RxR-CE benchmarks. Furthermore, to validate the versatility of our metric representation, we demonstrate zero-shot sim-to-real transfer across diverse embodiments, including a wheeled TurtleBot 4 and a custom-built aerial drone. These real-world deployments verify that our decoupled framework serves as a robust, domain-invariant interface for embodied Vision-and-Language navigation.
Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation
Multimodal Large Language Models
World Representation
Embodied AI
Zero-shot Transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

decoupled design
metric world representation
multimodal large language models
counterfactual reasoning
zero-shot sim-to-real transfer
🔎 Similar Papers
No similar papers found.
Zerui Li
Zerui Li
Adelaide Univeristy
RoboticsComputer VisionEmbodied AI
H
Hongpei Zheng
The University of Manchester
F
Fangguo Zhao
Zhejiang University
A
Aidan Chan
Australian Institute for Machine Learning, Adelaide University
J
Jian Zhou
Australian Institute for Machine Learning, Adelaide University
Sihao Lin
Sihao Lin
Postdoc, AIML, The University of Adelaide
Artificial intelligencePattern recognitionVision-language model
Shijie Li
Shijie Li
I2R, A-STAR
Computer Vision3D ReconstructionSLAM
Qi Wu
Qi Wu
Associate Professor, University of Adelaide, Adelaide, Australia
Computer VisionMachine LearningNLP