Multi-human Interactive Talking Dataset

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing talking video generation research is largely confined to monologue scenarios or isolated facial animation, failing to model the bodily coordination and speech interaction inherent in realistic multi-person dialogues. To address this, we introduce MIT—the first large-scale dataset for multi-person interactive talking video generation—comprising 12 hours of high-resolution, naturally occurring dialogue videos with 2–4 participants, accompanied by fine-grained multi-body pose and speech interaction annotations. We further propose CovOG, a benchmark model designed to handle variable participant counts, featuring a Multi-Person Pose Encoder (MPE) and an Interactive Audio-Driven (IAD) module to explicitly model cross-speaker motion coupling and speech-responsive dynamics. Automated acquisition and annotation ensure high data fidelity. Experiments demonstrate that CovOG significantly improves motion naturalness and lip-sync accuracy over prior methods. Together, the MIT dataset and CovOG establish a new foundation for research in multi-person interactive talking video generation.

Technology Category

Application Category

📝 Abstract

Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.

Problem

Research questions and friction points this paper is trying to address.

Addressing lack of datasets for multi-human talking video generation

Developing automatic pipeline for multi-person conversational video annotation

Proposing baseline model for realistic multi-speaker interaction synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic pipeline for multi-person video annotation

Multi-Human Pose Encoder for varying speakers

Interactive Audio Driver for head dynamics

🔎 Similar Papers

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems

2024-02-28arXiv.orgCitations: 93

TikTok

San Jose, California

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

Research Scientist Intern, Multimodal AI (PhD)