Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current gaze prediction models typically produce a single average scanpath, failing to capture inter-subject variability and task dependency in human visual exploration. To address this, we propose the first diffusion-based framework integrating Vision Transformers for scanpath generation. Our method models gaze variability via stochastic sampling in latent space and enables unified modeling of free-viewing and task-driven scenarios through text-conditioned control. Crucially, we pioneer the application of diffusion mechanisms to oculomotor prediction, supporting controllable, diverse, and physiologically plausible scanpath synthesis. Extensive experiments on multiple benchmark datasets demonstrate significant improvements over state-of-the-art methods. Quantitative metrics and qualitative analysis jointly validate our model’s superior capacity to capture key complexities of human visual behavior—including scanpath diversity, target consistency, and spatiotemporal dynamics—while preserving anatomical and cognitive plausibility.

Technology Category

Application Category

📝 Abstract

Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.

Problem

Research questions and friction points this paper is trying to address.

Predict diverse human gaze scanpaths accurately

Capture variability in human visual exploration

Enable task-driven scanpath generation with conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion models generate diverse human scanpaths

Vision Transformers enhance scanpath prediction accuracy

Textual conditioning enables task-driven gaze adaptation

🔎 Similar Papers

Unified Dynamic Scanpath Predictors Outperform Individually Trained Neural Models