EyEar: Learning Audio Synchronized Human Gaze Trajectory Based on Physics-Informed Dynamics

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces the novel task of audio-synchronized human gaze trajectory prediction and presents a large-scale multimodal audiovisual–eye-tracking dataset comprising 20,000 gaze points from multiple subjects. To address the challenge of modeling dynamic gaze under joint audiovisual guidance, we propose EyEar—a physics-informed framework that integrates three priors: oculomotor dynamics, visual saliency, and audio-semantic guidance. A probabilistic density scoring mechanism is introduced to mitigate inter-subject variability and enhance evaluation robustness. EyEar performs end-to-end learning of gaze motion dynamics by unifying physics-based modeling, cross-modal feature alignment, density estimation, and trajectory generation. Extensive experiments demonstrate statistically significant improvements over all state-of-the-art baselines across standard metrics, validating the critical role of physics-guided priors and audiovisual coupling in gaze prediction.

Technology Category

Application Category

📝 Abstract
Imitating how humans move their gaze in a visual scene is a vital research problem for both visual understanding and psychology, kindling crucial applications such as building alive virtual characters. Previous studies aim to predict gaze trajectories when humans are free-viewing an image, searching for required targets, or looking for clues to answer questions in an image. While these tasks focus on visual-centric scenarios, humans move their gaze also along with audio signal inputs in more common scenarios. To fill this gap, we introduce a new task that predicts human gaze trajectories in a visual scene with synchronized audio inputs and provide a new dataset containing 20k gaze points from 8 subjects. To effectively integrate audio information and simulate the dynamic process of human gaze motion, we propose a novel learning framework called EyEar (Eye moving while Ear listening) based on physics-informed dynamics, which considers three key factors to predict gazes: eye inherent motion tendency, vision salient attraction, and audio semantic attraction. We also propose a probability density score to overcome the high individual variability of gaze trajectories, thereby improving the stabilization of optimization and the reliability of the evaluation. Experimental results show that EyEar outperforms all the baselines in the context of all evaluation metrics, thanks to the proposed components in the learning model.
Problem

Research questions and friction points this paper is trying to address.

Predict human gaze trajectories with synchronized audio inputs.
Integrate audio information to simulate human gaze dynamics.
Overcome individual variability in gaze trajectory prediction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Physics-informed dynamics for gaze prediction
Integration of audio and visual data
Probability density score for trajectory stabilization
🔎 Similar Papers
No similar papers found.
X
Xiaochuan Liu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
X
Xin Cheng
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yuchong Sun
Yuchong Sun
Renmin University of China
Vision-Language
Xiaoxue Wu
Xiaoxue Wu
Fudan University
video generation
Ruihua Song
Ruihua Song
Renmin University of China
AI based creationmulti-modaltiy chitchatnatural language understandinginformation retrievalinformation extraction
H
Hao Sun
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
D
Denghao Zhang
Department of Psychology, Renmin University of China, Beijing, China