🤖 AI Summary
Existing audio-driven talking-head synthesis models exhibit severely limited generalization across racial, linguistic, and age groups, primarily due to insufficient scale, quality, and diversity of training data. Method: We introduce TalkVid—the first large-scale, high-quality talking-head video dataset systematically covering multiple ethnicities, languages, and age groups—and establish TalkVid-Bench, a hierarchical evaluation benchmark that uncovers previously masked subgroup performance disparities under conventional metrics. Our multi-stage automated curation pipeline jointly optimizes motion stability, facial detail fidelity, and aesthetic quality, supplemented by rigorous human verification. Contribution/Results: Models trained on TalkVid demonstrate significantly improved cross-dataset generalization and substantially more balanced performance across demographic subgroups. TalkVid and TalkVid-Bench collectively establish a foundational data and evaluation framework for developing fair, robust, and inclusive speech-driven visual generation models.
📝 Abstract
Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid