VL4Gaze: Unleashing Vision-Language Models for Gaze Following

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack systematic evaluation protocols and dedicated training benchmarks for human gaze understanding. Method: We introduce VL4Gaze—the first large-scale vision-language joint gaze understanding benchmark—comprising 124K images and 489K question-answer pairs. We unify gaze understanding into four vision-language QA tasks: gaze target description, gaze direction description, gaze point localization, and ambiguity identification. We propose a multi-task unified supervision paradigm, demonstrating that generic pretraining alone is insufficient; explicit modeling of gaze semantics and spatial localization is essential. The dataset is constructed via automated image-gaze annotation and structured question generation, and models are trained via dual pathways: in-context learning and supervised fine-tuning. Contribution/Results: Fine-tuning mainstream VLMs on VL4Gaze yields significant and consistent improvements across all four tasks, validating the critical role of task-specific multi-task supervision for gaze understanding.

Technology Category

Application Category

📝 Abstract
Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.
Problem

Research questions and friction points this paper is trying to address.

Evaluates vision-language models' gaze understanding capabilities
Creates benchmark for gaze interpretation as VQA tasks
Investigates need for targeted supervision in gaze analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces VL4Gaze benchmark for gaze understanding
Formulates gaze as unified VQA with four tasks
Uses multi-task supervision to improve VLM performance
🔎 Similar Papers
No similar papers found.