AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio-driven multi-speaker talking-head video generation faces two key bottlenecks: prohibitively high costs for acquiring high-quality, interactive multi-person data, and inherent difficulties in modeling coordinated identity-aware behavior and natural inter-speaker dynamics. To address these challenges, we propose a scalable multi-stream diffusion Transformer framework featuring a novel identity-aware attention mechanism and an iterative identity-audio pair processing paradigm. Our method requires only self-supervised pretraining on single-speaker videos, followed by fine-tuning on a small number of real multi-speaker clips. It supports arbitrary numbers of speakers and enables fine-grained control over inter-speaker interactions. Evaluated on our newly constructed benchmark dataset and metrics, our approach achieves significant improvements over state-of-the-art methods in lip-sync accuracy, visual fidelity, and interaction naturalness—marking the first work to enable high-quality, scalable multi-speaker talking-head synthesis under low-data regimes.

Technology Category

Application Category

📝 Abstract
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.
Problem

Research questions and friction points this paper is trying to address.

Generating synchronized multi-person talking videos from audio inputs
Overcoming high costs of diverse multi-person video data collection
Achieving coherent interactivity between multiple driven identities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extensible multi-stream architecture for identity scaling
Identity-aware attention mechanism in Diffusion Transformer
Training pipeline uses single-person videos for interactivity
🔎 Similar Papers
No similar papers found.
Zhizhou Zhong
Zhizhou Zhong
PhD student @ HKUST
face recognitionbiometricsaigc
Y
Yicheng Ji
Zhejiang University
Zhe Kong
Zhe Kong
Sun Yat-sen University
Generative modelImage and video synthesis
Y
Yiying Liu
Video Rebirth
J
Jiarui Wang
Video Rebirth
J
Jiasun Feng
Video Rebirth
L
Lupeng Liu
Video Rebirth
X
Xiangyi Wang
Video Rebirth
Y
Yanjia Li
Video Rebirth
Y
Yuqing She
Video Rebirth
Ying Qin
Ying Qin
Beijing Jiaotong University
H
Huan Li
Zhejiang University
S
Shuiyang Mao
Video Rebirth
W
Wei Liu
Video Rebirth
Wenhan Luo
Wenhan Luo
Associate Professor, HKUST
Creative AIGenerative ModelComputer VisionMachine Learning