Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses a fundamental tension in existing vision-language model (VLM)–large language model (LLM) navigation systems: the pursuit of pixel-level 3D perception accuracy often conflicts with the real-time efficiency required for embodied navigation. We reveal, for the first time within a standard VLM-LLM framework, a performance saturation phenomenon—beyond a certain threshold, further gains in 3D perception accuracy yield diminishing returns for navigation success. To this end, we establish theoretical upper bounds on navigation performance grounded in high-level planning and low-level execution, and introduce a reactive navigation mechanism driven by topological semantic mapping and spatial coordinates. Experimental results demonstrate that navigation performance does not scale linearly with perception fidelity, offering a new pathway toward efficient VLM-LLM navigation system design.

📝 Abstract

Zero-shot vision-and-language navigation (VLN) has gained significant attention due to its minimal data collection costs and inherent generalization. This paradigm is typically driven by the integration of pre-trained Vision-Language Models (VLMs) and Large Language Models (LLMs), where VLMs construct 3D scene graphs while LLMs handle high-level reasoning and decision-making. However, a critical bottleneck exists in this system: current 3D perception models prioritize pixel-level accuracy, directly conflicting with the strict computational limits and real-time efficiency demanded by embodied navigation. To address this gap, this paper quantifies the actual impact of 3D scene understanding capability on VLN performance. Based on typical VLM-LLM frameworks, we propose statistical success rate (SR) upper bounds for two core subsystems: 1) the slow LLM planner, which relies on topological mapping semantics, and 2) the fast reactive navigator, which utilizes spatial coordinates and bounding boxes to execute LLM decisions. Evaluations using state-of-the-art 3D scene understanding models validate our proposed bounds and reveal a perception saturation phenomenon, indicating that improvements in perception accuracy beyond a certain threshold yield diminishing returns in navigation success. Our findings suggest that 3D scene understanding for VLN should pivot away from strict pixel-level precision, prioritizing instead navigation-relevant core vocabularies and accurate bounding box proportions.

Problem

Research questions and friction points this paper is trying to address.

Vision-and-Language Navigation

3D Scene Understanding

Zero-Shot VLN

Perception Bottleneck

Embodied Navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot VLN

3D scene understanding

perception saturation