Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Current vision-language models (VLMs) exhibit severe limitations in viewpoint-aware reasoning, demonstrating strong egocentric bias and failing to support multi-view collaborative interaction. To address this, we propose the Abstract Perspective Conversion (APC) framework—the first to integrate mental imagery mechanisms into VLMs. APC constructs structured scene abstractions by unifying foundational models for object detection, instance segmentation, and orientation estimation, enabling differentiable perspective transformations and cross-view semantic alignment. This approach breaks the conventional reliance on a single egocentric viewpoint, facilitating generalizable, non-egocentric spatial reasoning. Evaluated on both synthetic and real-world image benchmarks, APC achieves an average accuracy improvement of 23.6% on viewpoint judgment and relative positional reasoning tasks—significantly outperforming state-of-the-art VLMs and spatial-reasoning fine-tuned models.

Technology Category

Application Category

📝 Abstract

We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

Problem

Research questions and friction points this paper is trying to address.

Enhancing perspective-aware reasoning in vision-language models

Addressing egocentric bias in visual understanding

Leveraging mental imagery for viewpoint shifts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mental imagery simulation for perspective-aware reasoning

Abstract Perspective Change (APC) framework

Leverages vision foundation models for scene abstractions

🔎 Similar Papers

No similar papers found.