Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

๐Ÿ“… 2025-05-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current vision-language models (VLMs) exhibit limited spatial and viewpoint reasoning capabilities, particularly in visual perspective-taking tasks. Method: We introduce a multi-level diagnostic benchmark comprising 144 controlled scenes spanning seven fine-grained problem categories, the first to adapt human psychological perspective-taking paradigms for VLM evaluation. Our approach integrates miniature human-object spatial configuration control, multi-view image generation, structured question-answering design, and a unified cross-model evaluation framework. Results: State-of-the-art models (e.g., GPT-4o) achieve >90% accuracy at the scene-understanding level but drop sharply to <40% at the viewpoint-taking levelโ€”revealing a critical gap rooted in deficient geometric representation learning and lack of task-specific training. This work uncovers a fundamental hierarchical disconnect in VLM cognition and establishes a reproducible, empirically grounded evaluation paradigm to advance spatially and viewpoint-aware multimodal modeling.

Technology Category

Application Category

๐Ÿ“ Abstract
We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a novel set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations - such as object position relative to the humanoid minifigure and the humanoid minifigure's orientation - and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each visual task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. Our evaluation of several state-of-the-art models, including GPT-4-Turbo, GPT-4o, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that while they excel in scene understanding, the performance declines significantly on spatial reasoning and further deteriorates on perspective-taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.
Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' ability in visual perspective taking tasks
Evaluating VLMs' performance decline in spatial reasoning tasks
Identifying gaps in VLMs' geometric and perspective reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Controlled scenes with humanoid minifigure and object
144 tasks with spatial and view variations
Diagnostic questions for three cognition levels
๐Ÿ”Ž Similar Papers
No similar papers found.
G
Gracjan Goral
Faculty of Mathematics, Informatics and Mechanics, University of Warsaw
Alicja Ziarko
Alicja Ziarko
University of Warsaw, Ideas NCBR, Institute of Mathematics of the Polish Academy of Sciences
P
Piotr Milos
Faculty of Mathematics, Informatics and Mechanics, University of Warsaw
Michal Nauman
Michal Nauman
UC Berkeley, University of Warsaw
machine learningreinforcement learning
M
Maciej Wolczyk
IDEAS NCBR
Michal Kosinski
Michal Kosinski
Stanford University
Psychology of Artificial IntelligencePersonalityPsychometrics