π€ AI Summary
This work addresses the lack of data-driven modeling of the coordination between natural gaze and head motion in existing methods, which often results in unnatural or gaze-disconnected head movements. To overcome this limitation, the paper introduces a novel approach that automatically extracts gazeβhead paired data from large-scale in-the-wild videos and employs a conditional variational autoencoder (cVAE) to explicitly model their probabilistic relationship and temporal dynamics. The framework integrates appearance-based gaze estimation, an automated data pipeline, and generative video synthesis to enable gaze-conditioned generation of diverse and realistic head motions. Experimental results demonstrate that the generated head movements significantly outperform current baselines, with human evaluations confirming higher perceived naturalness and user preference.
π Abstract
We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method's effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.