Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven talking-head methods are limited to single-speaker scenarios and struggle with synthesizing dialogue videos featuring multiple speakers co-occurring in the same physical space. This work systematically addresses two key challenges: precise audio-to-speaker binding and the absence of multi-speaker co-occurrence datasets. We propose a fine-grained Embedding Router integrating 3D-mask embedding routing with geometric prior loss; construct the first open-source dataset and processing pipeline dedicated to multi-speaker co-occurring dialogue, establishing a two-speaker dialogue generation benchmark; and extend the MM-DiT architecture with dynamic mask routing, mask refinement, and temporal smoothing loss to achieve frame-level speaker control and lip-sync accuracy. Evaluated on our curated dataset, our method significantly outperforms state-of-the-art approaches, demonstrating high-fidelity audio-speaker alignment, sharp facial details, and strong temporal consistency.

Technology Category

Application Category

📝 Abstract
Recent years have witnessed remarkable advances in audio-driven talking head generation. However, existing approaches predominantly focus on single-character scenarios. While some methods can create separate conversation videos between two individuals, the critical challenge of generating unified conversation videos with multiple physically co-present characters sharing the same spatial environment remains largely unaddressed. This setting presents two key challenges: audio-to-character correspondence control and the lack of suitable datasets featuring multi-character talking videos within the same scene. To address these challenges, we introduce Bind-Your-Avatar, an MM-DiT-based model specifically designed for multi-talking-character video generation in the same scene. Specifically, we propose (1) A novel framework incorporating a fine-grained Embedding Router that binds `who' and `speak what' together to address the audio-to-character correspondence control. (2) Two methods for implementing a 3D-mask embedding router that enables frame-wise, fine-grained control of individual characters, with distinct loss functions based on observed geometric priors and a mask refinement strategy to enhance the accuracy and temporal smoothness of the predicted masks. (3) The first dataset, to the best of our knowledge, specifically constructed for multi-talking-character video generation, and accompanied by an open-source data processing pipeline, and (4) A benchmark for the dual-talking-characters video generation, with extensive experiments demonstrating superior performance over multiple state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Generating unified multi-character videos in shared environments
Controlling audio-to-character correspondence in group conversations
Addressing lack of datasets for multi-talking-character scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

MM-DiT-based model for multi-character video generation
3D-mask embedding router for fine-grained control
First dataset for multi-talking-character video generation
🔎 Similar Papers
No similar papers found.
Y
Yubo Huang
University of Science and Technology of China
W
Weiqiang Wang
Monash University
Sirui Zhao
Sirui Zhao
University of Science and Technology of China
Affective ComputingMLLMHCI
T
Tong Xu
University of Science and Technology of China
L
Lin Liu
University of Science and Technology of China
Enhong Chen
Enhong Chen
University of Science and Technology of China
data miningrecommender systemmachine learning