Unified Dynamic Scanpath Predictors Outperform Individually Trained Neural Models

📅 2024-05-05
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing scanpath prediction methods predominantly rely on population-level models, neglecting inter-individual heterogeneity in eye movements and thus limiting applicability to real-world social human–computer interaction. Method: To address individual gaze prediction in dynamic social videos, we propose the first unified deep learning model that jointly encodes fixation history and social cues via a gated recurrent fusion mechanism, augmented with sequence-wise attention for dynamic saliency representation. The model implicitly learns both universal attention mechanisms and subject-specific patterns within a single architecture. Contribution/Results: Evaluated on a free-viewing social video dataset, it achieves performance on par with or superior to subject-specific models. Large-scale experiments demonstrate that late fusion significantly outperforms early fusion, validating the efficacy of combining universal representations with personalized supervision. Notably, this is the first work to enable cross-subject generalization—predicting diverse observers’ scanpaths using a single model—thereby substantially improving inter-subject transferability.

Technology Category

Application Category

📝 Abstract
Previous research on scanpath prediction has mainly focused on group models, disregarding the fact that the scanpaths and attentional behaviors of individuals are diverse. The disregard of these differences is especially detrimental to social human-robot interaction, whereby robots commonly emulate human gaze based on heuristics or predefined patterns. However, human gaze patterns are heterogeneous and varying behaviors can significantly affect the outcomes of such human-robot interactions. To fill this gap, we developed a deep learning-based social cue integration model for saliency prediction to instead predict scanpaths in videos. Our model learned scanpaths by recursively integrating fixation history and social cues through a gating mechanism and sequential attention. We evaluated our approach on gaze datasets of dynamic social scenes, observed under the free-viewing condition. The introduction of fixation history into our models makes it possible to train a single unified model rather than the resource-intensive approach of training individual models for each set of scanpaths. We observed that the late neural integration approach surpasses early fusion when training models on a large dataset, in comparison to a smaller dataset with a similar distribution. Results also indicate that a single unified model, trained on all the observers' scanpaths, performs on par or better than individually trained models. We hypothesize that this outcome is a result of the group saliency representations instilling universal attention in the model, while the supervisory signal and fixation history guide it to learn personalized attentional behaviors, providing the unified model a benefit over individual models due to its implicit representation of universal attention.
Problem

Research questions and friction points this paper is trying to address.

Predict diverse individual scanpaths in videos
Improve human-robot interaction via dynamic gaze prediction
Unified model outperforms individual neural models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep learning model integrates social cues dynamically
Unified model uses fixation history for personalization
Late neural integration outperforms early fusion
🔎 Similar Papers
No similar papers found.
F
Fares Abawi
Department of Informatics, University of Hamburg, Vogt-Koelln-Str. 30, 22527, Hamburg, Germany
D
Di Fu
Department of Informatics, University of Hamburg, Vogt-Koelln-Str. 30, 22527, Hamburg, Germany
S
Stefan Wermter
Department of Informatics, University of Hamburg, Vogt-Koelln-Str. 30, 22527, Hamburg, Germany