CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video-based face swapping faces a fundamental trade-off between identity transfer and preservation of dynamic attributes—such as pose, expression, and lip motion. Existing approaches prioritize identity fidelity but often compromise temporal consistency and motion detail. To address this, we propose a unified normalized representation space that explicitly decouples appearance and motion. Our method introduces a partially identity-modulated module, which adaptively fuses source identity features via spatial masking, enabling high-fidelity identity editing while fully preserving the target video’s dynamic structure. Furthermore, we design fine-grained synchronization metrics to guide optimization. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods across three key dimensions: visual quality, temporal coherence, and identity fidelity—yielding results that are more realistic, temporally stable, and geometrically precise.

Technology Category

Application Category

📝 Abstract
Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at https://luoxyhappy.github.io/CanonSwap/.
Problem

Research questions and friction points this paper is trying to address.

Decoupling facial motion and appearance in video face swapping
Preserving target face's dynamic attributes during identity transfer
Achieving high-fidelity identity transfer with minimal artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples motion and appearance via canonical space
Uses Partial Identity Modulation for precise transfer
Introduces fine-grained synchronization metrics for evaluation
🔎 Similar Papers
No similar papers found.
X
Xiangyang Luo
Tsinghua Shenzhen International Graduate School, Tsinghua University
Y
Ye Zhu
International Digital Economy Academy (IDEA)
Y
Yunfei Liu
International Digital Economy Academy (IDEA)
Lijian Lin
Lijian Lin
Tencent ARC Lab
Computer VisionVisual Tracking,Video Object Detection
Cong Wan
Cong Wan
Xian Jiaotong University
AIGC3Ddiffusion
Z
Zijian Cai
Xi’an Jiaotong University
Shao-Lun Huang
Shao-Lun Huang
T-SIGS, Tsinghua University
Information TheoryMachine learning
Y
Yu Li
International Digital Economy Academy (IDEA)