π€ AI Summary
This work addresses temporal inconsistency and identity distortion in diffusion-based video face swapping by proposing a plug-and-play, training-free approach. The method preserves source identity features through spectral attention interpolation, achieves precise facial alignment via target-structure-guided attention injection, and enhances inter-frame coherence with an optical flowβguided temporal smoothing mechanism. Notably, this is the first approach to seamlessly integrate with image-level diffusion-based face swapping models without requiring fine-tuning or additional training, significantly improving both temporal consistency and visual fidelity in video face swapping. The proposed solution is modular, practical, and readily deployable within existing pipelines.
π Abstract
We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.