InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional lip-sync video dubbing is limited to mouth-region editing, causing desynchronization between facial expressions and body motions and degrading immersion. This paper proposes a novel sparse-frame video dubbing paradigm that synthesizes audio-driven full-body motion by retaining only a few key reference frames. Our contributions include: (1) formally defining the sparse-frame dubbing task for the first time; (2) designing an adaptive conditional control mechanism that jointly models temporal context and performs fine-grained reference frame localization to ensure long-term consistency in identity, signature gestures, and camera motion; and (3) developing a streaming audio-driven architecture with optimized sampling strategies to enhance controllability and long-sequence stability of image-to-video diffusion models. Extensive experiments on HDTF, CelebV-HQ, and EMTD demonstrate state-of-the-art performance, with significant improvements in visual realism, emotional alignment, and full-body synchronization.

Technology Category

Application Category

📝 Abstract
Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for seamless inter-chunk transitions and incorporates a simple yet effective sampling strategy that optimizes control strength via fine-grained reference frame positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance. Quantitative metrics confirm superior visual realism, emotional coherence, and full-body motion synchronization.
Problem

Research questions and friction points this paper is trying to address.

Full-body motion synchronization in video dubbing
Adaptive conditioning for sparse-frame video generation
Infinite-length audio-driven human animation generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse-frame video dubbing paradigm
Streaming audio-driven generator architecture
Fine-grained reference frame positioning strategy
🔎 Similar Papers
No similar papers found.
S
Shaoshu Yang
School of Artificial Intelligence, University of Chinese Academy of Sciences
S
Shaoshu Yang
Meituan
S
Shaoshu Yang
New Laboratory of Pattern Recognition (NLPR), CASIA
Zhe Kong
Zhe Kong
Sun Yat-sen University
Generative modelImage and video synthesis
Zhe Kong
Zhe Kong
Sun Yat-sen University
Generative modelImage and video synthesis
Zhe Kong
Zhe Kong
Sun Yat-sen University
Generative modelImage and video synthesis
F
Feng Gao
Meituan
Meng Cheng
Meng Cheng
Wuhan University of Science and Technology
Li-ion BatterySolid-state Electrolyte3D PrintingIn Situ Electron Microscopy
X
Xiangyu Liu
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
X
Xiangyu Liu
Meituan
X
Xiangyu Liu
School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yong Zhang
Meituan
Z
Zhuoliang Kang
Meituan
Wenhan Luo
Wenhan Luo
Associate Professor, HKUST
Creative AIGenerative ModelComputer VisionMachine Learning
X
Xunliang Cai
Meituan
R
Ran He
School of Artificial Intelligence, University of Chinese Academy of Sciences
R
Ran He
New Laboratory of Pattern Recognition (NLPR), CASIA
Xiaoming Wei
Xiaoming Wei
Meituan
computer visionmachine learning