IROS: A Dual-Process Architecture for Real-Time VLM-Based Indoor Navigation

📅 2026-01-29
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing real-time performance and semantic understanding in indoor semantic navigation. While geometric approaches lack semantic awareness, continuously invoking vision-language models (VLMs) incurs prohibitive latency. To overcome this, we propose IROS, a novel framework that introduces the dual-process theory from cognitive science into real-time VLM-based navigation. IROS decouples fast, reflexive decision-making (System 1) from on-demand, lightweight VLM reasoning (System 2), effectively integrating spatial and textual cues for efficient semantic navigation. Experiments across five real-world building environments demonstrate that IROS reduces latency by 66% on low-power devices compared to continuous VLM invocation, while simultaneously improving decision accuracy.

Technology Category

Application Category

📝 Abstract
Indoor mobile robot navigation requires fast responsiveness and robust semantic understanding, yet existing methods struggle to provide both. Classical geometric approaches such as SLAM offer reliable localization but depend on detailed maps and cannot interpret human-targeted cues (e.g., signs, room numbers) essential for indoor reasoning. Vision-Language-Action (VLA) models introduce semantic grounding but remain strictly reactive, basing decisions only on visible frames and failing to anticipate unseen intersections or reason about distant textual cues. Vision-Language Models (VLMs) provide richer contextual inference but suffer from high computational latency, making them unsuitable for real-time operation on embedded platforms. In this work, we present IROS, a real-time navigation framework that combines VLM-level contextual reasoning with the efficiency of lightweight perceptual modules on low-cost, on-device hardware. Inspired by Dual Process Theory, IROS separates fast reflexive decisions (System One) from slow deliberative reasoning (System Two), invoking the VLM only when necessary. Furthermore, by augmenting compact VLMs with spatial and textual cues, IROS delivers robust, human-like navigation with minimal latency. Across five real-world buildings, IROS improves decision accuracy and reduces latency by 66% compared to continuous VLM-based navigation.
Problem

Research questions and friction points this paper is trying to address.

indoor navigation
vision-language models
real-time operation
semantic understanding
mobile robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Process Architecture
Vision-Language Model (VLM)
Real-Time Navigation
On-Device Inference
Semantic Indoor Navigation
🔎 Similar Papers
No similar papers found.