🤖 AI Summary
Current audio-driven multi-speaker talking-head video generation faces two key bottlenecks: prohibitively high costs for acquiring high-quality, interactive multi-person data, and inherent difficulties in modeling coordinated identity-aware behavior and natural inter-speaker dynamics. To address these challenges, we propose a scalable multi-stream diffusion Transformer framework featuring a novel identity-aware attention mechanism and an iterative identity-audio pair processing paradigm. Our method requires only self-supervised pretraining on single-speaker videos, followed by fine-tuning on a small number of real multi-speaker clips. It supports arbitrary numbers of speakers and enables fine-grained control over inter-speaker interactions. Evaluated on our newly constructed benchmark dataset and metrics, our approach achieves significant improvements over state-of-the-art methods in lip-sync accuracy, visual fidelity, and interaction naturalness—marking the first work to enable high-quality, scalable multi-speaker talking-head synthesis under low-data regimes.
📝 Abstract
Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.