A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of high-fidelity, large-scale speaker video datasets for real-time communication by introducing an open-source dataset comprising 847 near-original video clips (212 minutes total) from 805 participants captured with 446 consumer-grade cameras. The videos are preserved using lossless FFV1 encoding to retain original signal integrity and are accompanied by subjective mean opinion scores (MOS) and ten-dimensional perceptual quality annotations. For the first time, a benchmark subset encompassing three conditions—original, background blur, and background replacement—is established to systematically evaluate the performance of H.264, H.265, H.266, and AV1 codecs. At five times the scale of the largest existing dataset in this domain, the study demonstrates that H.266 achieves up to a 71.3% BD-rate reduction over H.264, offering a high-quality benchmark for video compression and enhancement research.
📝 Abstract
Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15\,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4\%) or MJPEG-encoded (75.6\%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4\% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs.\ 160 clips) with lossless signal fidelity, establishing a resource for training and benchmarking video compression and enhancement models in real-time communication.
Problem

Research questions and friction points this paper is trying to address.

talking-head video
dataset
signal fidelity
video compression
real-time communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

near-raw video dataset
lossless signal fidelity
talking-head video
perceptual quality annotation
video compression benchmarking
🔎 Similar Papers