🤖 AI Summary
This work proposes the first one-shot talking head synthesis framework based on 3D Gaussian splatting, eliminating the need for identity-specific training while achieving high-quality generation. Existing methods often rely on personalized models and struggle to generalize to unseen identities. To address this, the proposed approach incorporates a structured facial prior to enable full-head reconstruction and introduces a dual-branch motion field that separately models coarse- and fine-grained facial dynamics. Furthermore, a visibility-aware strategy explicitly separates visible and occluded regions to enhance detail fidelity. Experimental results demonstrate that the method outperforms current state-of-the-art techniques in visual quality, lip-sync accuracy, and inference efficiency, significantly improving cross-identity generalization without requiring per-subject optimization.
📝 Abstract
High-quality, real-time talking head synthesis remains a fundamental challenge in computer vision. Existing reconstruction- and rendering-based methods typically rely on identity-specific models, limiting cross-identity generalization. To address this issue, we propose SDTalk, a one-shot 3D Gaussian Splatting (3DGS)-based framework that generalizes to unseen identities without personalized training or fine-tuning. Our framework comprises two modules with a two-stage training strategy. In the first stage, we incorporate structured facial priors into the reconstruction module and separately predict 3DGS parameters for visible and occluded regions, enabling complete head reconstruction from a single image. In the second stage, we introduce a dual-branch motion field to model coarse and fine facial dynamics, improving detail fidelity and lip synchronization. Experiments demonstrate that SDTalk surpasses existing methods in both visual quality and inference efficiency.