Egocentric Bias in Vision-Language Models

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses a systematic egocentric bias in vision-language models (VLMs) on Level-2 visual perspective-taking (L2 VPT) tasks, revealing their inability to integrate theory of mind with spatial reasoning to simulate others’ viewpoints. The authors propose FlipSet, a diagnostic benchmark that isolates social cognition from spatial manipulation through controlled 180-degree rotations of 2D characters, thereby eliminating confounding factors from 3D scenes. Large-scale evaluation across 103 VLMs demonstrates that the vast majority perform below chance, with approximately 75% of errors directly replicating the model’s own visual perspective. This study provides the first evidence that while VLMs can independently handle mental rotation and theory-of-mind tasks, they lack the compositional mechanism necessary to effectively combine these capabilities, exposing a fundamental deficit in compositional socio-spatial reasoning.

Technology Category

Application Category

πŸ“ Abstract
Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

egocentric bias
visual perspective taking
vision-language models
spatial reasoning
theory of mind
Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric bias
visual perspective taking
FlipSet
vision-language models
spatial reasoning
Maijunxian Wang
Maijunxian Wang
University of California, Berkeley
Social JusticeArtificial General IntelligenceAI AlignmentAI EthicsMachine Cognition
Yijiang Li
Yijiang Li
Argonne National Laboratory
B
Bingyang Wang
School of Computer Science, Georgia Institute of Technology & Emory University
T
Tianwei Zhao
Department of Computer Science, Johns Hopkins University
R
Ran Ji
Department of Cognitive Science, University of California San Diego
Q
Qingying Gao
Department of Computer Science & Wilmer Eye Institute, Johns Hopkins University
Emmy Liu
Emmy Liu
PhD Student, Carnegie Mellon University
Hokin Deng
Hokin Deng
Johns Hopkins University
cognition
Dezhi Luo
Dezhi Luo
University of Michigan
cognitive sciencephilosophyAI