3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-driven 3D facial animation methods struggle to simultaneously achieve precise control, natural head motion, and efficient editing, while failing to model the diversity of lip and head movements for identical speech inputs. This paper proposes a fully convolutional diffusion-based framework that introduces a sparse-guided motion diffusion mechanism and viseme-level conditional control to explicitly model phoneme-level motion diversity. The approach supports personalized speaking style generation and localized re-synthesis. It further enables keyframe specification and interpolation-based editing for fine-grained animation control. Quantitative and qualitative evaluations demonstrate that our method significantly outperforms state-of-the-art approaches in animation naturalness, motion diversity, and editing flexibility. By unifying high-fidelity synthesis with intuitive, granular controllability, it establishes a new paradigm for editable, high-quality 3D facial animation generation.

Technology Category

Application Category

📝 Abstract
Creating personalized 3D animations with precise control and realistic head motions remains challenging for current speech-driven 3D facial animation methods. Editing these animations is especially complex and time consuming, requires precise control and typically handled by highly skilled animators. Most existing works focus on controlling style or emotion of the synthesized animation and cannot edit/regenerate parts of an input animation. They also overlook the fact that multiple plausible lip and head movements can match the same audio input. To address these challenges, we present 3DiFACE, a novel method for holistic speech-driven 3D facial animation. Our approach produces diverse plausible lip and head motions for a single audio input and allows for editing via keyframing and interpolation. Specifically, we propose a fully-convolutional diffusion model that can leverage the viseme-level diversity in our training corpus. Additionally, we employ a speaking-style personalization and a novel sparsely-guided motion diffusion to enable precise control and editing. Through quantitative and qualitative evaluations, we demonstrate that our method is capable of generating and editing diverse holistic 3D facial animations given a single audio input, with control between high fidelity and diversity. Code and models are available here: https://balamuruganthambiraja.github.io/3DiFACE
Problem

Research questions and friction points this paper is trying to address.

Creating personalized 3D animations with precise control
Editing animations is complex and requires skilled animators
Generating diverse plausible lip and head motions from audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully-convolutional diffusion model for viseme-level diversity
Speaking-style personalization for precise control
Sparsely-guided motion diffusion to enable editing
🔎 Similar Papers
No similar papers found.