CoV: Chain-of-View Prompting for Spatial Reasoning

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language models in 3D embodied question answering, where fixed viewpoints hinder access to dispersed and occluded contextual information, thereby impeding complex spatial reasoning. The authors propose a training-free, test-time reasoning framework that dynamically selects and refines observation viewpoints through a coarse-to-fine active exploration mechanism to gather comprehensive spatial context. Their approach innovatively integrates chain-based viewpoint exploration with question-aligned anchor view selection, enabling a model-agnostic, open-ended viewpoint search strategy. A view selection agent filters redundant frames, while iterative reasoning combined with discrete camera actions efficiently acquires new observations. Evaluated on OpenEQA, the method improves LLM-Match by 11.56% on average (up to 13.62%) and demonstrates significant performance gains on ScanQA and SQA3D, showcasing strong test-time scalability.

Technology Category

Application Category

📝 Abstract
Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56% improvement in LLM-Match, with a maximum gain of +13.62% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51% average improvement, peaking at +3.73% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training. Code is available on https://github.com/ziplab/CoV .
Problem

Research questions and friction points this paper is trying to address.

Embodied Question Answering
3D environments
spatial reasoning
vision-language models
viewpoint selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-View prompting
spatial reasoning
embodied question answering
active viewpoint exploration
vision-language models
🔎 Similar Papers
No similar papers found.
H
Haoyu Zhao
ZIP Lab, Zhejiang University
Akide Liu
Akide Liu
PhD Student @ Monash University
Efficient AIComputer Vision
Z
Zeyu Zhang
Monash University
Weijie Wang
Weijie Wang
PhD Student, Zhejiang University
Computer VisionEfficient AIDeep Learning
F
Feng Chen
AIML, Adelaide University
R
Ruihan Zhu
ZIP Lab, Zhejiang University
G
Gholamreza Haffari
Monash University
Bohan Zhuang
Bohan Zhuang
Zhejiang University
Efficient AIMLSys