FaceVid-1K: A Large-Scale High-Quality Multiracial Human Face Video Dataset

📅 2024-09-23

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing high-quality talking-face video datasets are small-scale and severely underrepresent Asian populations, leading to demographic bias and resolution limitations in generative models. To address this, we introduce FaceVid-1K—the first large-scale, multi-ethnic, high-resolution real-world talking-face video dataset comprising over 1,000 precisely temporally aligned videos, systematically curated and expanded from diverse public sources to cover Asian, European, African, and other ethnic groups. We propose a domain-specific pretraining paradigm for talking-face video generation, integrating text- and image-conditioned as well as unconditional models (e.g., SVD, Tune-A-Video), enhanced via rigorous data cleaning, lip-sync–speech alignment, and multimodal annotation. Extensive experiments demonstrate substantial improvements over state-of-the-art baselines across multiple talking-face generation benchmarks. Both the FaceVid-1K dataset and the pretrained models are publicly released to foster standardized, reproducible research in the field.

Technology Category

Application Category

📝 Abstract

Generating talking face videos from various conditions has recently become a highly popular research area within generative tasks. However, building a high-quality face video generation model requires a well-performing pre-trained backbone, a key obstacle that universal models fail to adequately address. Most existing works rely on universal video or image generation models and optimize control mechanisms, but they neglect the evident upper bound in video quality due to the limited capabilities of the backbones, which is a result of the lack of high-quality human face video datasets. In this work, we investigate the unsatisfactory results from related studies, gather and trim existing public talking face video datasets, and additionally collect and annotate a large-scale dataset, resulting in a comprehensive, high-quality multiracial face collection named extbf{FaceVid-1K}. Using this dataset, we craft several effective pre-trained backbone models for face video generation. Specifically, we conduct experiments with several well-established video generation models, including text-to-video, image-to-video, and unconditional video generation, under various settings. We obtain the corresponding performance benchmarks and compared them with those trained on public datasets to demonstrate the superiority of our dataset. These experiments also allow us to investigate empirical strategies for crafting domain-specific video generation tasks with cost-effective settings. We will make our curated dataset, along with the pre-trained talking face video generation models, publicly available as a resource contribution to hopefully advance the research field.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale high-quality face video datasets

Underrepresentation of Asian faces in existing datasets

Need for diverse multi-ethnic face video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale high-quality face video dataset

Multi-ethnic coverage with comprehensive attributes

Supports text-to-video and image-to-video generation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence