Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Inspired by infant visual development, this work addresses the limitation of conventional self-supervised learning in modeling biologically plausible perceptual maturation. We propose CATDiet—a “Cascaded Adaptive Training Diet”—a developmental self-supervised paradigm that progressively introduces low-resolution, grayscale-to-color, blurry-to-sharp, and temporally coherent visual inputs, mirroring early human vision. Methodologically, we design CombDiet: a staged pretraining strategy that explicitly embeds developmental constraints into video-based self-supervised learning. Our contributions are threefold: (1) the first systematic integration of human infant visual development principles into machine vision training; (2) construction of a novel evaluation benchmark spanning multiple cognitive dimensions (e.g., depth perception, occlusion robustness, temporal coherence); and (3) state-of-the-art performance across 10 benchmarks—demonstrating superior robustness to occlusions and noise, enhanced depth estimation, and emergent infant-like visual behaviors—thereby validating the structural role of developmental trajectories in intelligent perception formation.

Technology Category

Application Category

📝 Abstract

Newborns perceive the world with low-acuity, color-degraded, and temporally continuous vision, which gradually sharpens as infants develop. To explore the ecological advantages of such staged "visual diets", we train self-supervised learning (SSL) models on object-centric videos under constraints that simulate infant vision: grayscale-to-color (C), blur-to-sharp (A), and preserved temporal continuity (T)-collectively termed CATDiet. For evaluation, we establish a comprehensive benchmark across ten datasets, covering clean and corrupted image recognition, texture-shape cue conflict tests, silhouette recognition, depth-order classification, and the visual cliff paradigm. All CATDiet variants demonstrate enhanced robustness in object recognition, despite being trained solely on object-centric videos. Remarkably, models also exhibit biologically aligned developmental patterns, including neural plasticity changes mirroring synaptic density in macaque V1 and behaviors resembling infants' visual cliff responses. Building on these insights, CombDiet initializes SSL with CATDiet before standard training while preserving temporal continuity. Trained on object-centric or head-mounted infant videos, CombDiet outperforms standard SSL on both in-domain and out-of-domain object recognition and depth perception. Together, these results suggest that the developmental progression of early infant visual experience offers a powerful reverse-engineering framework for understanding the emergence of robust visual intelligence in machines. All code, data, and models will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

Simulating infant visual development stages to improve machine learning

Enhancing visual robustness in AI through biologically-inspired training methods

Exploring how early visual constraints contribute to object recognition capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulated infant vision constraints in training

Combined grayscale blur temporal continuity CATDiet

Initialized SSL with CATDiet before standard training

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?