🤖 AI Summary
Video-based face swapping faces a fundamental trade-off between identity transfer and preservation of dynamic attributes—such as pose, expression, and lip motion. Existing approaches prioritize identity fidelity but often compromise temporal consistency and motion detail. To address this, we propose a unified normalized representation space that explicitly decouples appearance and motion. Our method introduces a partially identity-modulated module, which adaptively fuses source identity features via spatial masking, enabling high-fidelity identity editing while fully preserving the target video’s dynamic structure. Furthermore, we design fine-grained synchronization metrics to guide optimization. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods across three key dimensions: visual quality, temporal coherence, and identity fidelity—yielding results that are more realistic, temporally stable, and geometrically precise.
📝 Abstract
Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face's dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at https://luoxyhappy.github.io/CanonSwap/.