CSGaze: Context-aware Social Gaze Prediction

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of modeling and interpreting social gaze patterns in multi-person dialogue scenarios, where existing methods underutilize contextual cues. To this end, we propose a speaker-centric fine-grained attention mechanism that jointly integrates facial features, visual scene information, and context-aware representations within a multimodal gaze prediction framework. Our mechanism explicitly models socially driven gaze dynamics by generating interpretable attention scores. Evaluated on three benchmark datasets—GP-Static, UCO-LAEO, and AVA-LAEO—the framework achieves state-of-the-art performance and demonstrates strong generalization under open-set settings. Our key contribution lies in the first deep integration of context awareness with speaker-centric modeling, simultaneously enhancing prediction accuracy and behavioral interpretability. This work establishes a novel paradigm for understanding attention allocation in social interactions.

Technology Category

Application Category

📝 Abstract
A person's gaze offers valuable insights into their focus of attention, level of social engagement, and confidence. In this work, we investigate how contextual cues combined with visual scene and facial information can be effectively utilized to predict and interpret social gaze patterns during conversational interactions. We introduce CSGaze, a context aware multimodal approach that leverages facial, scene information as complementary inputs to enhance social gaze pattern prediction from multi-person images. The model also incorporates a fine-grained attention mechanism centered on the principal speaker, which helps in better modeling social gaze dynamics. Experimental results show that CSGaze performs competitively with state-of-the-art methods on GP-Static, UCO-LAEO and AVA-LAEO. Our findings highlight the role of contextual cues in improving social gaze prediction. Additionally, we provide initial explainability through generated attention scores, offering insights into the model's decision-making process. We also demonstrate our model's generalizability by testing our model on open set datasets that demonstrating its robustness across diverse scenarios.
Problem

Research questions and friction points this paper is trying to address.

Predicting social gaze patterns using contextual and visual cues
Enhancing gaze prediction with multimodal facial and scene information
Improving model interpretability through attention mechanisms and explainability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages facial and scene information for gaze prediction
Uses fine-grained attention mechanism on principal speaker
Generates explainable attention scores for model decisions