🤖 AI Summary
Speech-driven 3D eye motion modeling is hindered by weak audio–eye correlations and scarcity of high-quality paired data. To address this, we introduce the first 14-hour high-fidelity audio–eye movement synchronized dataset. Our method proposes a dual-latent-space speech-to-motion framework inspired by physiological principles: it decouples head motion from eye motion, integrates lightweight eye fitting, self-supervised audio–motion alignment, and a joint generative network—enabling end-to-end, photorealistic, real-time synthesis of 3D gaze, blinking, and facial–head motion. Quantitatively, our approach achieves state-of-the-art performance in FID, MCD, and other metrics; subjectively, it attains a MOS of 4.21. It significantly improves motion diversity and naturalness over prior work, delivering the first deployable, speech-driven solution for virtual human gaze animation.
📝 Abstract
Although significant progress has been made in the field of speech-driven 3D facial animation recently, the speech-driven animation of an indispensable facial component, eye gaze, has been overlooked by recent research. This is primarily due to the weak correlation between speech and eye gaze, as well as the scarcity of audio-gaze data, making it very challenging to generate 3D eye gaze motion from speech alone. In this paper, we propose a novel data-driven method which can generate diverse 3D eye gaze motions in harmony with the speech. To achieve this, we firstly construct an audio-gaze dataset that contains about 14 hours of audio-mesh sequences featuring high-quality eye gaze motion, head motion and facial motion simultaneously. The motion data is acquired by performing lightweight eye gaze fitting and face reconstruction on videos from existing audio-visual datasets. We then tailor a novel speech-to-motion translation framework in which the head motions and eye gaze motions are jointly generated from speech but are modeled in two separate latent spaces. This design stems from the physiological knowledge that the rotation range of eyeballs is less than that of head. Through mapping the speech embedding into the two latent spaces, the difficulty in modeling the weak correlation between speech and non-verbal motion is thus attenuated. Finally, our TalkingEyes, integrated with a speech-driven 3D facial motion generator, can synthesize eye gaze motion, eye blinks, head motion and facial motion collectively from speech. Extensive quantitative and qualitative evaluations demonstrate the superiority of the proposed method in generating diverse and natural 3D eye gaze motions from speech. The project page of this paper is: https://lkjkjoiuiu.github.io/TalkingEyes_Home/