CoV: Chain-of-View Prompting for Spatial Reasoning

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language models in 3D embodied question answering, where fixed viewpoints hinder access to dispersed and occluded contextual information, thereby impeding complex spatial reasoning. The authors propose a training-free, test-time reasoning framework that dynamically selects and refines observation viewpoints through a coarse-to-fine active exploration mechanism to gather comprehensive spatial context. Their approach innovatively integrates chain-based viewpoint exploration with question-aligned anchor view selection, enabling a model-agnostic, open-ended viewpoint search strategy. A view selection agent filters redundant frames, while iterative reasoning combined with discrete camera actions efficiently acquires new observations. Evaluated on OpenEQA, the method improves LLM-Match by 11.56% on average (up to 13.62%) and demonstrates significant performance gains on ScanQA and SQA3D, showcasing strong test-time scalability.

Technology Category

Application Category

📝 Abstract

Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56% improvement in LLM-Match, with a maximum gain of +13.62% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51% average improvement, peaking at +3.73% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training. Code is available on https://github.com/ziplab/CoV .

Problem

Research questions and friction points this paper is trying to address.

Embodied Question Answering

3D environments

spatial reasoning

vision-language models

viewpoint selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-View prompting

spatial reasoning

embodied question answering