🤖 AI Summary
This work addresses the limitations of existing speech-driven 3D facial animation methods, which rely on registered template meshes and struggle to generalize to raw 3D scans with arbitrary topology while lacking controllable modeling of emotional dynamics. To overcome these challenges, we propose FreeTalk, a two-stage framework that operates without template registration and supports arbitrary mesh topologies. In the first stage, an Audio-To-Sparse (ATS) module predicts sparse keypoint displacements conditioned on both speech and emotion. In the second stage, a Sparse-To-Mesh (STM) module leverages intrinsic surface features to perform unsupervised mesh deformation. FreeTalk is the first method to enable emotion-controllable and topology-agnostic 3D talking head generation, achieving performance on par with specialized baselines in-domain while significantly improving generalization and robustness on unseen identities and arbitrary-topology meshes.
📝 Abstract
Speech-driven 3D facial animation has advanced rapidly, yet most approaches remain tied to registered template meshes, preventing effective deployment on raw 3D scans with arbitrary topology. At the same time, modeling controllable emotional dynamics beyond lip articulation remains challenging, and is often tied to template-based parameterizations. We address these challenges by proposing FreeTalk, a two-stage framework for emotion-conditioned 3D talking-head animation that generalizes to unregistered face meshes with arbitrary vertex count and connectivity. First, Audio-To-Sparse (ATS) predicts a temporally coherent sequence of 3D landmark displacements from speech audio, conditioned on an emotion category and intensity. This sparse representation captures both articulatory and affective motion while remaining independent of mesh topology. Second, Sparse-To-Mesh (STM) transfers the predicted landmark motion to a target mesh by combining intrinsic surface features with landmark-to-vertex conditioning, producing dense per-vertex deformations without template fitting or correspondence supervision at test time. Extensive experiments show that FreeTalk matches specialized baselines when trained in-domain, while providing substantially improved robustness to unseen identities and mesh topologies. Code and pre-trained models will be made publicly available.