AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current audio-driven multi-speaker talking-head video generation faces two key bottlenecks: prohibitively high costs for acquiring high-quality, interactive multi-person data, and inherent difficulties in modeling coordinated identity-aware behavior and natural inter-speaker dynamics. To address these challenges, we propose a scalable multi-stream diffusion Transformer framework featuring a novel identity-aware attention mechanism and an iterative identity-audio pair processing paradigm. Our method requires only self-supervised pretraining on single-speaker videos, followed by fine-tuning on a small number of real multi-speaker clips. It supports arbitrary numbers of speakers and enables fine-grained control over inter-speaker interactions. Evaluated on our newly constructed benchmark dataset and metrics, our approach achieves significant improvements over state-of-the-art methods in lip-sync accuracy, visual fidelity, and interaction naturalness—marking the first work to enable high-quality, scalable multi-speaker talking-head synthesis under low-data regimes.

Technology Category

Application Category

📝 Abstract

Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.

Problem

Research questions and friction points this paper is trying to address.

Generating synchronized multi-person talking videos from audio inputs

Overcoming high costs of diverse multi-person video data collection

Achieving coherent interactivity between multiple driven identities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extensible multi-stream architecture for identity scaling

Identity-aware attention mechanism in Diffusion Transformer

Training pipeline uses single-person videos for interactivity

🔎 Similar Papers

No similar papers found.