VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing language-driven navigation methods struggle to balance generalization, interpretability, and computational efficiency. This work proposes an end-to-end vision–language–action (VLA) agent with 3 billion parameters that departs from conventional multi-model ensembles or embedding-matching paradigms, achieving human-interpretable embodied reasoning within a compact architecture for the first time. The approach unifies object recognition and navigation decision-making through a three-stage explicit image anchoring mechanism—think, summarize, act—that grounds reasoning in visual observations. Experimental results demonstrate that the model significantly enhances both generalization and decision interpretability while maintaining efficient inference, thereby circumventing error propagation and high computational overhead inherent in modular pipelines.

Technology Category

Application Category

📝 Abstract
Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer"Is this the target object?"and"Why should I take this action?"The reasoning process unfolds in three stages:"think","think summary", and"action", yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

language-driven object navigation
vision-language reasoning
embodied AI
zero-shot generalization
explainable navigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
image-grounded reasoning
embodied reasoning
language-driven navigation
explainable AI
🔎 Similar Papers
No similar papers found.