DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

📅 2026-01-04

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of jointly optimizing identity similarity, attribute fidelity—including expression, pose, and illumination—and temporal consistency in video face swapping. To this end, we propose DreamID-V, the first framework to introduce diffusion Transformers to this task, featuring a modality-aware conditional injection mechanism and an identity-anchored video synthesizer. We further design SyncID-Pipe, a data pipeline that generates bidirectional identity quadruplets, and employ a synthetic-to-real curriculum learning strategy combined with identity-consistency reinforcement learning to significantly enhance both identity fidelity and temporal stability. Additionally, we release IDBench-V, the first comprehensive benchmark for video face swapping evaluation. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches across multiple metrics and exhibits strong generalization across diverse face swapping scenarios.

Technology Category

Application Category

📝 Abstract

Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.

Problem

Research questions and friction points this paper is trying to address.

Video Face Swapping

Identity Preservation

Temporal Consistency

Attribute Preservation

Image-to-Video Gap

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer

Video Face Swapping

Modality-Aware Conditioning

Identity-Coherence Reinforcement Learning

Synthetic-to-Real Curriculum

🔎 Similar Papers

No similar papers found.