Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-fidelity and temporal consistency pose dual challenges in cinematic-scale long-video face swapping. To address these, we propose LivingSwap—the first video-reference-guided controllable face-swapping framework. It employs a keyframe-conditioned identity editing mechanism for precise identity control, integrates a novel video-reference guidance module to harmonize source-video attributes—including expression, illumination, and motion—and incorporates a temporal stitching module to ensure inter-frame consistency. Furthermore, we introduce Face2Face, a newly curated paired face-swapping dataset, and adopt a data-pair reversal strategy to strengthen supervised training. Extensive experiments demonstrate that LivingSwap significantly outperforms state-of-the-art methods under complex scenarios: it achieves cinematic-quality reconstruction while preserving target identity stability and substantially reducing manual intervention.

Technology Category

Application Category

📝 Abstract
Video face swapping is crucial in film and entertainment production, where achieving high fidelity and temporal consistency over long and complex video sequences remains a significant challenge. Inspired by recent advances in reference-guided image editing, we explore whether rich visual attributes from source videos can be similarly leveraged to enhance both fidelity and temporal coherence in video face swapping. Building on this insight, this work presents LivingSwap, the first video reference guided face swapping model. Our approach employs keyframes as conditioning signals to inject the target identity, enabling flexible and controllable editing. By combining keyframe conditioning with video reference guidance, the model performs temporal stitching to ensure stable identity preservation and high-fidelity reconstruction across long video sequences. To address the scarcity of data for reference-guided training, we construct a paired face-swapping dataset, Face2Face, and further reverse the data pairs to ensure reliable ground-truth supervision. Extensive experiments demonstrate that our method achieves state-of-the-art results, seamlessly integrating the target identity with the source video's expressions, lighting, and motion, while significantly reducing manual effort in production workflows. Project webpage: https://aim-uofa.github.io/LivingSwap
Problem

Research questions and friction points this paper is trying to address.

Enhances fidelity and temporal coherence in video face swapping
Ensures stable identity preservation across long video sequences
Reduces manual effort in film and entertainment production workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Keyframe conditioning for identity injection
Video reference guidance for temporal stitching
Paired dataset construction for reliable supervision
🔎 Similar Papers
No similar papers found.
Z
Zekai Luo
Zhejiang University
Z
Zongze Du
Zhejiang University
Z
Zhouhang Zhu
Zhejiang University
Hao Zhong
Hao Zhong
Professor, Shanghai Jiao Tong University
Software Engineering
Muzhi Zhu
Muzhi Zhu
Zhejiang University
Computer VisionMachine Learning
W
Wen Wang
Zhejiang University
Y
Yuling Xi
Zhejiang University
C
Chenchen Jing
Zhejiang University of Technology
H
Hao Chen
Zhejiang University
Chunhua Shen
Chunhua Shen
Zhejiang University
Computer VisionMachine Learning