SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

career value

276K/year

🤖 AI Summary

The field of audio-visual two-person interactive virtual human generation suffers from a lack of large-scale, high-quality datasets. Method: This paper introduces SpeakerVid-5M—the first such dataset comprising 5.2 million portrait videos totaling 8,743 hours—designed with a dual-dimensional structured framework: (i) categorization into four interaction-type branches (e.g., interview, debate), and (ii) quality-based stratification to support both pretraining and supervised fine-tuning. Automated construction leverages ASR, video segmentation, and multi-stage quality filtering. We also release VidChatBench, a dedicated evaluation benchmark. Contribution/Results: Leveraging SpeakerVid-5M, we train a video generation baseline model capable of dialogue-aware response generation—the first systematic foundation for diverse virtual human conversational synthesis. This work advances standardization, reproducibility, and scalability in the field.

Technology Category

Application Category

📝 Abstract

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark VidChatBench for future work. Both the dataset and the corresponding data processing code will be publicly released. Project page: https://dorniwang.github.io/SpeakerVid-5M/

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale dataset for audio-visual dyadic interactive virtual humans

Need for diverse interaction types and high-quality data for virtual human tasks

Absence of benchmarks for evaluating audio-visual dyadic interactive human generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale high-quality audio-visual dyadic dataset

Dual structure for pre-training and fine-tuning

Autoregressive-based video chat baseline model

🔎 Similar Papers

FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset

2024-09-23arXiv.orgCitations: 2

TikTok

San Jose, California

Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

Research Scientist Intern, Multimodal AI (PhD)