🤖 AI Summary
This work addresses a fundamental tension in existing vision-language model (VLM)–large language model (LLM) navigation systems: the pursuit of pixel-level 3D perception accuracy often conflicts with the real-time efficiency required for embodied navigation. We reveal, for the first time within a standard VLM-LLM framework, a performance saturation phenomenon—beyond a certain threshold, further gains in 3D perception accuracy yield diminishing returns for navigation success. To this end, we establish theoretical upper bounds on navigation performance grounded in high-level planning and low-level execution, and introduce a reactive navigation mechanism driven by topological semantic mapping and spatial coordinates. Experimental results demonstrate that navigation performance does not scale linearly with perception fidelity, offering a new pathway toward efficient VLM-LLM navigation system design.
📝 Abstract
Zero-shot vision-and-language navigation (VLN) has gained significant attention due to its minimal data collection costs and inherent generalization. This paradigm is typically driven by the integration of pre-trained Vision-Language Models (VLMs) and Large Language Models (LLMs), where VLMs construct 3D scene graphs while LLMs handle high-level reasoning and decision-making. However, a critical bottleneck exists in this system: current 3D perception models prioritize pixel-level accuracy, directly conflicting with the strict computational limits and real-time efficiency demanded by embodied navigation. To address this gap, this paper quantifies the actual impact of 3D scene understanding capability on VLN performance. Based on typical VLM-LLM frameworks, we propose statistical success rate (SR) upper bounds for two core subsystems: 1) the slow LLM planner, which relies on topological mapping semantics, and 2) the fast reactive navigator, which utilizes spatial coordinates and bounding boxes to execute LLM decisions. Evaluations using state-of-the-art 3D scene understanding models validate our proposed bounds and reveal a perception saturation phenomenon, indicating that improvements in perception accuracy beyond a certain threshold yield diminishing returns in navigation success. Our findings suggest that 3D scene understanding for VLN should pivot away from strict pixel-level precision, prioritizing instead navigation-relevant core vocabularies and accurate bounding box proportions.