Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

Current vision-language models (VLMs) lack systematic evaluation in their ability to understand human gaze direction and socially meaningful looking behavior. This work proposes EyeVLM, an evaluation framework that, for the first time, decomposes gaze understanding into two distinct tasks: geometric vision (gaze following) and social semantics (social gaze prediction). The framework conducts comprehensive assessments across multiple models, prompting strategies, and data scales under both zero-shot and fine-tuned settings. Experimental results demonstrate that existing VLMs significantly underperform specialized vision-only models in precise gaze understanding, with a notable performance gap persisting even after fine-tuning. These findings highlight critical limitations and point toward important directions for future improvement in modeling human visual attention within multimodal systems.

📝 Abstract

Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.

Problem

Research questions and friction points this paper is trying to address.

gaze following

social gaze prediction

vision-language models

human attention

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

gaze following

social gaze prediction

vision-language models