Boosting Zero-Shot VLN via Abstract Obstacle Map-Based Waypoint Prediction with TopoGraph-and-VisitInfo-Aware Prompting

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses zero-shot vision-language navigation (VLN) in continuous environments, proposing a training-free end-to-end framework. To tackle the challenges of grounding language instructions in unstructured 3D spaces without task-specific supervision, the method introduces three key components: (1) an abstract obstacle graph enabling linearly reachable waypoint prediction; (2) a dynamic topological graph that jointly encodes explicit visit history and spatial semantics to provide embodied contextual prompts for multimodal large language models; and (3) explicit spatial reasoning and exploration memory mechanisms. These innovations collectively enhance path planning robustness and error recovery capability. Evaluated on the R2R-CE and RxR-CE benchmarks—standard continuous-environment zero-shot VLN benchmarks—the approach achieves state-of-the-art success rates of 41% and 36%, respectively. To our knowledge, this represents the highest reported performance for zero-shot VLN in continuous settings.

Technology Category

Application Category

📝 Abstract
With the rapid progress of foundation models and robotics, vision-language navigation (VLN) has emerged as a key task for embodied agents with broad practical applications. We address VLN in continuous environments, a particularly challenging setting where an agent must jointly interpret natural language instructions, perceive its surroundings, and plan low-level actions. We propose a zero-shot framework that integrates a simplified yet effective waypoint predictor with a multimodal large language model (MLLM). The predictor operates on an abstract obstacle map, producing linearly reachable waypoints, which are incorporated into a dynamically updated topological graph with explicit visitation records. The graph and visitation information are encoded into the prompt, enabling reasoning over both spatial structure and exploration history to encourage exploration and equip MLLM with local path planning for error correction. Extensive experiments on R2R-CE and RxR-CE show that our method achieves state-of-the-art zero-shot performance, with success rates of 41% and 36%, respectively, outperforming prior state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Addressing vision-language navigation in continuous environments for embodied agents
Integrating natural language interpretation with environmental perception and action planning
Enhancing zero-shot navigation performance through abstract map-based waypoint prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Abstract obstacle map-based waypoint prediction
Topological graph with explicit visitation records
Spatial and history-aware prompting for MLLM
🔎 Similar Papers
No similar papers found.
Boqi Li
Boqi Li
Postdoc Research Fellow, Civil and Environmental Engineering, University of Michigan
connected mobility systems
S
Siyuan Li
Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI, USA
W
Weiyi Wang
Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI, USA
Anran Li
Anran Li
Yale University
Trustworthy AImedical LLMsfederated learning
Zhong Cao
Zhong Cao
University of Michigan
Autonomous VehicleReinforcement Learning
H
Henry X. Liu
Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, MI, USA