Identity-Preserving Video Dubbing Using Motion Warping

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing lip-sync video dubbing methods struggle to simultaneously achieve high lip motion synchronization accuracy and faithful preservation of speaker identity characteristics—such as skin texture and lip morphology—leading to significant degradation in visual detail fidelity and identity consistency. To address this, we propose the first Transformer-based audio–reference image dynamic cross-modal alignment framework. Our method integrates differentiable motion warping, multi-scale texture enhancement, and occlusion-aware inpainting modules to jointly optimize precise lip motion synchronization and fine-grained identity preservation. Evaluated on multiple standard benchmarks, our approach achieves state-of-the-art performance: a 32% reduction in lip-sync error (LSE), a 41% improvement in identity similarity (ID-Sim), and substantial gains in perceptual realism and user preference over prior methods.

Technology Category

Application Category

📝 Abstract
Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal. Although existing methods can accurately generate mouth shapes driven by audio, they often fail to preserve identity-specific features, largely because they do not effectively capture the nuanced interplay between audio cues and the visual attributes of reference identity . As a result, the generated outputs frequently lack fidelity in reproducing the unique textural and structural details of the reference identity. To address these limitations, we propose IPTalker, a novel and robust framework for video dubbing that achieves seamless alignment between driving audio and reference identity while ensuring both lip-sync accuracy and high-fidelity identity preservation. At the core of IPTalker is a transformer-based alignment mechanism designed to dynamically capture and model the correspondence between audio features and reference images, thereby enabling precise, identity-aware audio-visual integration. Building on this alignment, a motion warping strategy further refines the results by spatially deforming reference images to match the target audio-driven configuration. A dedicated refinement process then mitigates occlusion artifacts and enhances the preservation of fine-grained textures, such as mouth details and skin features. Extensive qualitative and quantitative evaluations demonstrate that IPTalker consistently outperforms existing approaches in terms of realism, lip synchronization, and identity retention, establishing a new state of the art for high-quality, identity-consistent video dubbing.
Problem

Research questions and friction points this paper is trying to address.

Video Dubbing
Identity Preservation
Lip Synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

IPTalker
audio-visual synchronization
video dubbing realism
🔎 Similar Papers
R
Runzhen Liu
Department of Computer Science and Engineering, South China University of Technology
Qinjie Lin
Qinjie Lin
PhD in computer science, Northwestern University
Robotics systemRobot LearningReinforcement Learning
Y
Yunfei Liu
Vistring Lab, IDEA
Lijian Lin
Lijian Lin
Tencent ARC Lab
Computer VisionVisual Tracking,Video Object Detection
Y
Ye Zhu
Vistring Lab, IDEA
Y
Yu Li
Vistring Lab, IDEA
Chuhua Xian
Chuhua Xian
South China University of Technology
Computer Graphics