Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven human animation methods are largely confined to single-speaker scenarios, struggling with multi-stream audio inputs, suffering from audio-character binding errors, and exhibiting weak instruction-following capabilities. This paper formally defines and addresses the novel task of “multi-character dialogue video generation.” We propose Label-Rotary Position Encoding (L-RoPE) to achieve precise spatiotemporal alignment between audio streams and corresponding characters. Furthermore, we design a multi-stream audio-injected diffusion framework integrating character-identity-aware spatiotemporal modeling, parameter-efficient fine-tuning, and multi-task collaborative optimization—including instruction tuning. Our method achieves state-of-the-art performance across talking-head, talking-body, and multi-character dialogue benchmarks, significantly improving lip-sync accuracy, visual fidelity, and instruction-response correctness.

Technology Category

Application Category

📝 Abstract
Audio-driven human animation methods, such as talking head and talking body generation, have made remarkable progress in generating synchronized facial movements and appealing visual quality videos. However, existing methods primarily focus on single human animation and struggle with multi-stream audio inputs, facing incorrect binding problems between audio and persons. Additionally, they exhibit limitations in instruction-following capabilities. To solve this problem, in this paper, we propose a novel task: Multi-Person Conversational Video Generation, and introduce a new framework, MultiTalk, to address the challenges during multi-person generation. Specifically, for audio injection, we investigate several schemes and propose the Label Rotary Position Embedding (L-RoPE) method to resolve the audio and person binding problem. Furthermore, during training, we observe that partial parameter training and multi-task training are crucial for preserving the instruction-following ability of the base model. MultiTalk achieves superior performance compared to other methods on several datasets, including talking head, talking body, and multi-person datasets, demonstrating the powerful generation capabilities of our approach.
Problem

Research questions and friction points this paper is trying to address.

Address incorrect audio-person binding in multi-stream inputs
Enhance instruction-following in multi-person video generation
Generate synchronized multi-person conversational videos from audio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Label Rotary Position Embedding for audio binding
Partial parameter training for instruction-following
Multi-task training enhances base model capabilities
🔎 Similar Papers
No similar papers found.
Zhe Kong
Zhe Kong
Sun Yat-sen University
Generative modelImage and video synthesis
F
Feng Gao
Meituan
Y
Yong Zhang
Meituan
Z
Zhuoliang Kang
Meituan
Xiaoming Wei
Xiaoming Wei
Meituan
computer visionmachine learning
X
Xunliang Cai
Meituan
G
Guanying Chen
Shenzhen Campus of Sun Yat-sen University
Wenhan Luo
Wenhan Luo
Associate Professor, HKUST
Creative AIGenerative ModelComputer VisionMachine Learning