The BabyView dataset: High-resolution egocentric videos of infants' and young children's everyday experiences

📅 2024-06-14

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Human children achieve highly efficient learning from extremely few samples, whereas contemporary AI models exhibit a substantial “data gap” relative to this capability—primarily due to the absence of large-scale, high-fidelity, first-person developmental experience data. Method: We introduce the first longitudinal, high-resolution (1080p/60fps) infant egocentric video dataset, spanning 493 hours across ages 6 months to 5 years and covering diverse home and preschool settings. It is synchronized with IMU sensor streams and rigorously annotated with multimodal gold-standard labels—including speech transcription, speaker diarization, and human pose estimation. Contribution/Results: This dataset establishes the first real-world, multimodal, quantifiable benchmark of “human training data” for developmental cognitive modeling. It enables a novel paradigm for cross-species sample-efficiency comparison. Experiments reveal that state-of-the-art vision-language models underperform significantly against human annotations—particularly exposing their fragility under infant-specific perceptual distributions—and thereby define a new human-aligned standard for evaluating sample efficiency.

Technology Category

Application Category

📝 Abstract

Human children far exceed modern machine learning algorithms in their sample efficiency, achieving high performance in key domains with much less data than current models. This ''data gap'' is a key challenge both for building intelligent artificial systems and for understanding human development. Egocentric video capturing children's experience -- their ''training data'' -- is a key ingredient for comparison of humans and models and for the development of algorithmic innovations to bridge this gap. Yet there are few such datasets available, and extant data are low-resolution, have limited metadata, and importantly, represent only a small set of children's experiences. Here, we provide the first release of the largest developmental egocentric video dataset to date -- the BabyView dataset -- recorded using a high-resolution camera with a large vertical field-of-view and gyroscope/accelerometer data. This 493 hour dataset includes egocentric videos from children spanning 6 months - 5 years of age in both longitudinal, at-home contexts and in a preschool environment. We provide gold-standard annotations for the evaluation of speech transcription, speaker diarization, and human pose estimation, and evaluate models in each of these domains. We train self-supervised language and vision models and evaluate their transfer to out-of-distribution tasks including syntactic structure learning, object recognition, depth estimation, and image segmentation. Although performance in each scales with dataset size, overall performance is relatively lower than when models are trained on curated datasets, especially in the visual domain. Our dataset stands as an open challenge for robust, humanlike AI systems: how can such systems achieve human-levels of success on the same scale and distribution of training data as humans?

Problem

Research questions and friction points this paper is trying to address.

Bridging the data gap between human children and AI sample efficiency

Lack of high-resolution egocentric video datasets for child development studies

Improving AI model performance on human-like training data scales

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-resolution egocentric video dataset

Includes gyroscope and accelerometer data

Gold-standard annotations for multiple tasks

🔎 Similar Papers

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker

2024-10-02arXiv.orgCitations: 0

A Novel Dataset for Video-Based Autism Classification Leveraging Extra-Stimulatory Behavior

2024-09-06arXiv.orgCitations: 1

Google

$147,000-$211,000 + bonus + equity + benefits

San Francisco, CA, USA / Mountain View, CA, USA

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)