OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the disconnect between semantic reasoning and continuous spatial localization in existing gaze-following methods, as well as their computational inefficiency in multi-person scenarios. The authors propose OmniGF, a novel framework featuring a dual-branch decoding architecture that jointly handles discrete semantic inference and continuous coordinate prediction. One branch leverages dense hidden states from vision-language models under high-resolution heatmap supervision to achieve precise gaze target localization, while the other introduces head embeddings to jointly model appearance and orientation across multiple individuals. By moving beyond purely text-based coordinate generation, OmniGF overcomes prevailing accuracy limitations, achieving state-of-the-art performance on multiple standard benchmarks and significantly enhancing both spatial localization precision and semantic understanding in complex social scenes.
📝 Abstract
Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM's dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at https://github.com/cvlab-stonybrook/omnigf.
Problem

Research questions and friction points this paper is trying to address.

gaze following
vision-language models
spatial localization
semantic reasoning
multi-person scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-branch decoding
vision-language model
gaze following
spatial heatmap supervision
multi-person gaze reasoning