Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Current vision-language models (VLMs) struggle with geometric consistency and cross-view alignment in multi-view spatial reasoning, and lack cognitively fine-grained evaluation benchmarks. Method: We introduce ReMindView-Bench—the first cognitive science–driven benchmark explicitly designed to assess the construction and alignment of multi-view spatial mental models. It systematically diagnoses information degradation across perception, integration, and reasoning stages via explicit stepwise analysis and implicit state probing. Our evaluation framework integrates LLM-as-a-judge, self-consistency prompting, linear probing, and dynamic entropy analysis to characterize the evolution of uncertainty. Contribution/Results: Evaluating 15 state-of-the-art VLMs reveals robust intra-frame perception but substantial degradation in cross-view integration; task-relevant information consistently decays with reasoning depth. ReMindView-Bench establishes a novel, interpretable paradigm for assessing VLMs’ spatial reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments. However, current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings. We attribute this gap to the lack of fine-grained benchmarks that isolate multi-view reasoning from single-view perception and temporal factors. To address this, we present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints. ReMindView-Bench systematically varies viewpoint spatial pattern and query type to probe key factors of spatial cognition. Evaluations of 15 current VLMs reveals consistent failures in cross-view alignment and perspective-taking in multi-view spatial reasoning, motivating deeper analysis on the reasoning process. Explicit phase-wise analysis using LLM-as-a-judge and self-consistency prompting shows that VLMs perform well on in-frame perception but degrade sharply when integrating information across views. Implicit analysis, including linear probing and entropy dynamics, further show progressive loss of task-relevant information and uncertainty separation between correct and incorrect trajectories. These results provide a cognitively grounded diagnosis of VLM spatial reasoning and reveal how multi-view spatial mental models are formed, degraded and destabilized across reasoning phases. The ReMindView-Bench benchmark is available at https://huggingface.co/datasets/Xue0823/ReMindView-Bench, and the source codes of benchmark construction and VLM reasoning analysis are available at https://github.com/pittisl/ReMindView-Bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' spatial reasoning across multiple viewpoints

Diagnoses failures in cross-view alignment and perspective-taking

Analyzes how spatial mental models degrade during reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ReMindView-Bench benchmark for multi-view spatial reasoning evaluation

Uses explicit phase-wise analysis with LLM-as-a-judge and self-consistency prompting

Applies implicit analysis like linear probing and entropy dynamics

🔎 Similar Papers

No similar papers found.