DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of jointly optimizing identity similarity, attribute fidelity—including expression, pose, and illumination—and temporal consistency in video face swapping. To this end, we propose DreamID-V, the first framework to introduce diffusion Transformers to this task, featuring a modality-aware conditional injection mechanism and an identity-anchored video synthesizer. We further design SyncID-Pipe, a data pipeline that generates bidirectional identity quadruplets, and employ a synthetic-to-real curriculum learning strategy combined with identity-consistency reinforcement learning to significantly enhance both identity fidelity and temporal stability. Additionally, we release IDBench-V, the first comprehensive benchmark for video face swapping evaluation. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches across multiple metrics and exhibits strong generalization across diverse face swapping scenarios.

Technology Category

Application Category

📝 Abstract
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video while meticulously preserving the original pose, expression, lighting, background, and dynamic information. Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency. To address the challenge, we propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping (IFS) to the video domain. We first introduce a novel data pipeline SyncID-Pipe that pre-trains an Identity-Anchored Video Synthesizer and combines it with IFS models to construct bidirectional ID quadruplets for explicit supervision. Building upon paired data, we propose the first Diffusion Transformer-based framework DreamID-V, employing a core Modality-Aware Conditioning module to discriminatively inject multi-model conditions. Meanwhile, we propose a Synthetic-to-Real Curriculum mechanism and an Identity-Coherence Reinforcement Learning strategy to enhance visual realism and identity consistency under challenging scenarios. To address the issue of limited benchmarks, we introduce IDBench-V, a comprehensive benchmark encompassing diverse scenes. Extensive experiments demonstrate DreamID-V outperforms state-of-the-art methods and further exhibits exceptional versatility, which can be seamlessly adapted to various swap-related tasks.
Problem

Research questions and friction points this paper is trying to address.

Video Face Swapping
Identity Preservation
Temporal Consistency
Attribute Preservation
Image-to-Video Gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer
Video Face Swapping
Modality-Aware Conditioning
Identity-Coherence Reinforcement Learning
Synthetic-to-Real Curriculum
🔎 Similar Papers
No similar papers found.
Xu Guo
Xu Guo
Tsinghua University
Generative ModelsReinforcement learning
Fulong Ye
Fulong Ye
ByteDance
Vision-Language PretrainGenerative modelsDiffusion Models
X
Xinghui Li
Intelligent Creation Lab, ByteDance
P
Pengqi Tu
Intelligent Creation Lab, ByteDance
P
Pengze Zhang
Intelligent Creation Lab, ByteDance
Q
Qichao Sun
Intelligent Creation Lab, ByteDance
S
Songtao Zhao
Intelligent Creation Lab, ByteDance
Xiangwang Hou
Xiangwang Hou
Department of EE, Tsinghua University
Wireless Federated LearningEdge IntelligenceUAV/AUV Swarm
Qian He
Qian He
ByteDance