🤖 AI Summary
Existing vision-language models (VLMs) lack systematic evaluation protocols and dedicated training benchmarks for human gaze understanding. Method: We introduce VL4Gaze—the first large-scale vision-language joint gaze understanding benchmark—comprising 124K images and 489K question-answer pairs. We unify gaze understanding into four vision-language QA tasks: gaze target description, gaze direction description, gaze point localization, and ambiguity identification. We propose a multi-task unified supervision paradigm, demonstrating that generic pretraining alone is insufficient; explicit modeling of gaze semantics and spatial localization is essential. The dataset is constructed via automated image-gaze annotation and structured question generation, and models are trained via dual pathways: in-context learning and supervised fine-tuning. Contribution/Results: Fine-tuning mainstream VLMs on VL4Gaze yields significant and consistent improvements across all four tasks, validating the critical role of task-specific multi-task supervision for gaze understanding.
📝 Abstract
Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.