SFCo-Nav: Efficient Zero-Shot Visual Language Navigation via Collaboration of Slow LLM and Fast Attributed Graph Alignment

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high latency and computational overhead of existing zero-shot vision-and-language navigation (VLN) methods, which rely on large language model (LLM) inference at every step, hindering real-time deployment. Inspired by dual-process cognitive theory, we propose an efficient zero-shot VLN framework comprising a slow LLM-based planner that generates subgoal chains and a fast navigator that executes these subgoals. A lightweight asynchronous bridging mechanism aligns imagined and perceptual graphs, enabling confidence-driven, on-demand LLM invocation. To our knowledge, this is the first zero-shot VLN system that integrates internal confidence-triggered collaboration between slow and fast components. Our method matches or exceeds state-of-the-art zero-shot performance on the R2R and REVERIE benchmarks while reducing token consumption by over 50% and accelerating inference by 3.5×, with successful real-world validation on a quadruped robot in a hotel environment.

Technology Category

Application Category

📝 Abstract
Recent advances in large vision-language models (VLMs) and large language models (LLMs) have enabled zero-shot approaches to visual language navigation (VLN), where an agent follows natural language instructions using only ego perception and reasoning. However, existing zero-shot methods typically construct a naive observation graph and perform per-step VLM-LLM inference on it, resulting in high latency and computation costs that limit real-time deployment. To address this, we present SFCo-Nav, an efficient zero-shot VLN framework inspired by the principle of slow-fast cognitive collaboration. SFCo-Nav integrates three key modules: 1) a slow LLM-based planner that produces a strategic chain of subgoals, each linked to an imagined object graph; 2) a fast reactive navigator for real-time object graph construction and subgoal execution; and 3) a lightweight asynchronous slow-fast bridge aligns advanced structured, attributed imagined and perceived graphs to estimate navigation confidence, triggering the slow LLM planner only when necessary. To the best of our knowledge, SFCo-Nav is the first slow-fast collaboration zero-shot VLN system supporting asynchronous LLM triggering according to the internal confidence. Evaluated on the public R2R and REVERIE benchmarks, SFCo-Nav matches or exceeds prior state-of-the-art zero-shot VLN success rates while cutting total token consumption per trajectory by over 50% and running more than 3.5 times faster. Finally, we demonstrate SFCo-Nav on a legged robot in a hotel suite, showcasing its efficiency and practicality in indoor environments.
Problem

Research questions and friction points this paper is trying to address.

visual language navigation
zero-shot
large language models
computation efficiency
real-time deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

slow-fast collaboration
zero-shot visual language navigation
asynchronous LLM triggering
attributed graph alignment
efficient VLN
🔎 Similar Papers
No similar papers found.
C
Chaoran Xiong
Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai Jiao Tong University (SJTU), Shanghai 200240, China
L
Litao Wei
Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai Jiao Tong University (SJTU), Shanghai 200240, China; Zhiyuan College, SJTU
X
Xinhao Hu
Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai Jiao Tong University (SJTU), Shanghai 200240, China
K
Kehui Ma
Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai Jiao Tong University (SJTU), Shanghai 200240, China
Ziyi Xia
Ziyi Xia
University of British Columbia
Computer GraphicsVRMachine Learning
Z
Zixin Jiang
Shanghai Key Laboratory of Navigation and Location Based Services, Shanghai Jiao Tong University (SJTU), Shanghai 200240, China
Zhen Sun
Zhen Sun
DSA Thrust, HKUST(GZ)
LLM security
Ling Pei
Ling Pei
Shanghai Jiao Tong University
NavigationPositioningSLAMSensor FusionGNSS