BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) suffer from low training data efficiency, and mainstream evaluation benchmarks are misaligned with developmental psychology—either oversimplified and narrow in scope, or tailored exclusively for large-scale pretraining. Method: Inspired by infant visual cognitive development, we propose the first developmentally aligned VLM pretraining paradigm. Leveraging real infant datasets (e.g., SAYCam), we design child-directed synthetic augmentations—including object-centric framing, speech rate modulation, and spatial cropping—to generate high-fidelity training data. We further introduce the first benchmark covering multi-dimensional infant-like visual reasoning. Contribution/Results: Experiments show that a lightweight VLM trained via our BabyVLM framework achieves a 12.7% average accuracy gain on the new benchmark under equal data budgets, significantly outperforming baseline models. This validates the effectiveness of jointly optimizing developmental authenticity and task breadth.

Technology Category

Application Category

📝 Abstract
Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned--they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.
Problem

Research questions and friction points this paper is trying to address.

Address misaligned evaluation benchmarks for infant-inspired VLMs
Overcome limitations of training solely on infant data
Develop data-efficient pretraining for vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developmentally inspired pretraining for VLMs
Synthetic dataset via child-directed transformations
Compact models generalize with curated data
🔎 Similar Papers
No similar papers found.