DiffMagicFace: Identity Consistent Facial Editing of Real Videos

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
Existing text-guided facial video editing methods struggle to simultaneously preserve identity consistency and cross-frame semantic coherence. This work proposes a dual-control diffusion framework that requires no real video training, achieving high-fidelity and temporally consistent facial editing during inference through synergistic text and image guidance. By leveraging 3D face reconstruction and optimization, the method constructs a multi-view synthetic dataset, enabling—for the first time—highly consistent editing results without any video-based training. On challenging tasks such as talking-head generation, the approach surpasses state-of-the-art methods in both visual quality and quantitative metrics, delivering results comparable to those produced by professional rendering software.

Technology Category

Application Category

📝 Abstract
Text-conditioned image editing has greatly benefitted from the advancements in Image Diffusion Models. However, extending these techniques to facial video editing introduces challenges in preserving facial identity throughout the source video and ensuring consistency of the edited subject across frames. In this paper, we introduce DiffMagicFace, a unique video editing framework that integrates two fine-tuned models for text and image control. These models operate concurrently during inference to produce video frames that maintain identity features while seamlessly aligning with the editing semantics. To ensure the consistency of the edited videos, we develop a dataset comprising images showcasing various facial perspectives for each edited subject. The creation of a data set is achieved through rendering techniques and the subsequent application of optimization algorithms. Remarkably, our approach does not depend on video datasets but still delivers high-quality results in both consistency and content. The excellent effect holds even for complex tasks like talking head videos and distinguishing closely related categories. The videos edited using our framework exhibit parity with videos that are made using traditional rendering software. Through comparative analysis with current state-of-the-art methods, our framework demonstrates superior performance in both visual appeal and quantitative metrics.
Problem

Research questions and friction points this paper is trying to address.

identity consistency
facial video editing
text-conditioned editing
temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

identity consistency
text-conditioned video editing
diffusion models
multi-view dataset generation
frame-level coherence
🔎 Similar Papers