CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses a critical limitation in current vision-language models (VLMs) for indoor navigation: their frequent disregard for the physical and operational constraints of embodied agents, which often leads to unrealistic or infeasible paths. To bridge this gap, we introduce CapNav—the first vision-language navigation benchmark that explicitly incorporates agent capability constraints. CapNav defines five types of embodied agents with distinct capability profiles and includes 473 navigation tasks and 2,365 question-answer pairs across 45 real-world indoor scenes. Integrating vision-language modeling, embodied reasoning, 3D scene understanding, and spatial inference, CapNav enables systematic evaluation of VLM performance under realistic capability constraints. Experiments on 13 state-of-the-art VLMs reveal significant performance degradation under strict capability limits, particularly in obstacle handling requiring spatial dimension reasoning, underscoring both the necessity and challenge of advancing embodied spatial reasoning in navigation.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Navigation

Capability-Conditioned Navigation

Embodied Spatial Reasoning

Mobility Constraints

Indoor Navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Capability-Conditioned Navigation

Vision-Language Models

Embodied Spatial Reasoning