VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses temporal inconsistency and identity distortion in diffusion-based video face swapping by proposing a plug-and-play, training-free approach. The method preserves source identity features through spectral attention interpolation, achieves precise facial alignment via target-structure-guided attention injection, and enhances inter-frame coherence with an optical flow–guided temporal smoothing mechanism. Notably, this is the first approach to seamlessly integrate with image-level diffusion-based face swapping models without requiring fine-tuning or additional training, significantly improving both temporal consistency and visual fidelity in video face swapping. The proposed solution is modular, practical, and readily deployable within existing pipelines.

Technology Category

Application Category

📝 Abstract

We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.

Problem

Research questions and friction points this paper is trying to address.

video face swapping

temporal consistency

diffusion models

spatiotemporal coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-Free

Diffusion-Based Face Swapping

Temporal Consistency

Attention Injection

Flow-Guided Smoothing

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

AI Research Scientist, Video Generation and Post Training, FAIR