🤖 AI Summary
This work addresses the high latency and computational overhead of existing zero-shot vision-and-language navigation (VLN) methods, which rely on large language model (LLM) inference at every step, hindering real-time deployment. Inspired by dual-process cognitive theory, we propose an efficient zero-shot VLN framework comprising a slow LLM-based planner that generates subgoal chains and a fast navigator that executes these subgoals. A lightweight asynchronous bridging mechanism aligns imagined and perceptual graphs, enabling confidence-driven, on-demand LLM invocation. To our knowledge, this is the first zero-shot VLN system that integrates internal confidence-triggered collaboration between slow and fast components. Our method matches or exceeds state-of-the-art zero-shot performance on the R2R and REVERIE benchmarks while reducing token consumption by over 50% and accelerating inference by 3.5×, with successful real-world validation on a quadruped robot in a hotel environment.
📝 Abstract
Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.