🤖 AI Summary
This work addresses the limitation of existing video gaze prediction methods, which are typically constrained to short temporal windows (usually 3–5 seconds) and thus struggle to model long-range spatiotemporal dependencies and fine-grained dynamics. To overcome this, we propose the first autoregressive diffusion-based generative framework capable of processing videos of arbitrary length. Our approach jointly generates high-resolution temporal timestamps and continuous spatial coordinates of realistic raw gaze trajectories, guided by a saliency-aware visual latent space. By transcending conventional short-horizon constraints, the method effectively captures long-term behavioral dependencies and achieves significant improvements over state-of-the-art approaches in both spatiotemporal accuracy and trajectory realism, as demonstrated through comprehensive quantitative and qualitative evaluations.
📝 Abstract
Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.