Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This study investigates whether current vision-language models (VLMs) possess genuine theory of mind—specifically, the ability to infer human gaze direction—rather than relying on superficial heuristics. Method: We systematically evaluate 111 VLMs using controlled image stimuli, multi-level difficulty scaling, mixed-effects statistical modeling, and rigorous comparison against human behavioral data. Contribution/Results: 94 of 111 models perform no better than chance; the top five exhibit only marginal gains that degrade with increasing task difficulty and remain robust to prompt and object perturbations—indicating reliance on non-semantic, surface-level cues. We introduce the first “heuristic + random” hybrid behavioral model to formalize this non-random yet non-semantic failure mode in gaze reasoning. Our findings reveal a pervasive absence of foundational theory-of-mind capabilities across modern VLMs, establishing a critical benchmark and theoretical caution for embodied AI and human–machine interaction requiring reliable gaze understanding.

Technology Category

Application Category

📝 Abstract

Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs' ability to infer human gaze direction

Compare VLM performance with humans in gaze inference

Analyze VLMs' heuristics and limitations in gaze tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated 111 VLMs on gaze inference

Compared VLM performance with humans

Analyzed top VLMs using mixed-effects models

🔎 Similar Papers

Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models

2024-10-02arXiv.orgCitations: 1