TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven talking-head synthesis models exhibit severely limited generalization across racial, linguistic, and age groups, primarily due to insufficient scale, quality, and diversity of training data. Method: We introduce TalkVid—the first large-scale, high-quality talking-head video dataset systematically covering multiple ethnicities, languages, and age groups—and establish TalkVid-Bench, a hierarchical evaluation benchmark that uncovers previously masked subgroup performance disparities under conventional metrics. Our multi-stage automated curation pipeline jointly optimizes motion stability, facial detail fidelity, and aesthetic quality, supplemented by rigorous human verification. Contribution/Results: Models trained on TalkVid demonstrate significantly improved cross-dataset generalization and substantially more balanced performance across demographic subgroups. TalkVid and TalkVid-Bench collectively establish a foundational data and evaluation framework for developing fair, robust, and inclusive speech-driven visual generation models.

Technology Category

Application Category

📝 Abstract
Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid
Problem

Research questions and friction points this paper is trying to address.

Lack of generalization in audio-driven talking head synthesis across diverse demographics
Limitations in existing training data regarding scale, quality, and diversity
Performance disparities obscured by traditional aggregate evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale diverse dataset for talking head synthesis
Automated pipeline filtering motion stability and quality
Stratified evaluation set balanced across demographics
🔎 Similar Papers
No similar papers found.
Shunian Chen
Shunian Chen
The Chinese University of Hong Kong, Shenzhen
Large Language ModelsMultimodal Large Language ModelsAgent
H
Hejin Huang
The Chinese University of Hong Kong, Shenzhen, Sun Yat-sen University
Yexin Liu
Yexin Liu
The Hong Kong University of Science and Technology
AIGC
Z
Zihan Ye
The Chinese University of Hong Kong, Shenzhen
P
Pengcheng Chen
The Chinese University of Hong Kong, Shenzhen
Chenghao Zhu
Chenghao Zhu
University of Electronic Science and Technology of China
M
Michael Guan
The Chinese University of Hong Kong, Shenzhen
Rongsheng Wang
Rongsheng Wang
The Chinese University of Hong Kong, Shenzhen
Deep Learning
J
Junying Chen
The Chinese University of Hong Kong, Shenzhen
G
Guanbin Li
Sun Yat-sen University
S
Ser-Nam Lim
The Hong Kong University of Science and Technology
Harry Yang
Harry Yang
HKUST
computer visionmachine learning
Benyou Wang
Benyou Wang
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
large language modelsnatural language processinginformation retrievalapplied machine learning