Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether current vision-language models (VLMs) possess genuine theory of mind—specifically, the ability to infer human gaze direction—rather than relying on superficial heuristics. Method: We systematically evaluate 111 VLMs using controlled image stimuli, multi-level difficulty scaling, mixed-effects statistical modeling, and rigorous comparison against human behavioral data. Contribution/Results: 94 of 111 models perform no better than chance; the top five exhibit only marginal gains that degrade with increasing task difficulty and remain robust to prompt and object perturbations—indicating reliance on non-semantic, surface-level cues. We introduce the first “heuristic + random” hybrid behavioral model to formalize this non-random yet non-semantic failure mode in gaze reasoning. Our findings reveal a pervasive absence of foundational theory-of-mind capabilities across modern VLMs, establishing a critical benchmark and theoretical caution for embodied AI and human–machine interaction requiring reliable gaze understanding.

Technology Category

Application Category

📝 Abstract
Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.
Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs' ability to infer human gaze direction
Compare VLM performance with humans in gaze inference
Analyze VLMs' heuristics and limitations in gaze tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 111 VLMs on gaze inference
Compared VLM performance with humans
Analyzed top VLMs using mixed-effects models
🔎 Similar Papers
No similar papers found.
Z
Zory Zhang
Brown University
P
Pinyuan Feng
Columbia University
B
Bingyang Wang
Emory University
T
Tianwei Zhao
Johns Hopkins University
S
Suyang Yu
University of Washington
Q
Qingying Gao
Johns Hopkins University
Hokin Deng
Hokin Deng
Johns Hopkins University
cognition
Ziqiao Ma
Ziqiao Ma
University of Michigan
Machine LearningComputational Linguistics
Yijiang Li
Yijiang Li
Argonne National Laboratory
Dezhi Luo
Dezhi Luo
University of Michigan
cognitive sciencephilosophyAI