🤖 AI Summary
Existing speech-driven 3D talking head methods are constrained by fixed mesh topologies, limiting generalization to arbitrary topologies—e.g., real-world scanned faces. To address this, we propose the first topology-agnostic speech-driven animation framework. Our method introduces: (i) a registration-free training paradigm eliminating reliance on point-to-point correspondences; (ii) a heat-diffusion-based feature prediction mechanism ensuring topology-robust geometric modeling across diverse meshes; (iii) an adaptive graph neural network that learns dynamic graph structures per input; and (iv) a multi-granularity lip-sync evaluation metric suite addressing shortcomings of conventional metrics in temporal alignment and semantic consistency. Experiments demonstrate high-fidelity animation on arbitrary-topology 3D faces—including unseen scanned data—outperforming fixed-topology baselines. We establish the first topology-independent benchmark for 3D talking heads.
📝 Abstract
Generating speech-driven 3D talking heads presents numerous challenges; among those is dealing with varying mesh topologies where no point-wise correspondence exists across all meshes the model can animate. While simplifying the problem, it limits applicability as unseen meshes must adhere to the training topology. This work presents a framework capable of animating 3D faces in arbitrary topologies, including real scanned data. Our approach relies on a model leveraging heat diffusion to predict features robust to the mesh topology. We explore two training settings: a registered one, in which meshes in a training sequences share a fixed topology but any mesh can be animated at test time, and an fully unregistered one, which allows effective training with varying mesh structures. Additionally, we highlight the limitations of current evaluation metrics and propose new metrics for better lip-syncing evaluation between speech and facial movements. Our extensive evaluation shows our approach performs favorably compared to fixed topology techniques, setting a new benchmark by offering a versatile and high-fidelity solution for 3D talking head generation where the topology constraint is dropped.