π€ AI Summary
This study investigates the input statistics of object category representations in infantsβ early visual experience and their implications for visual learning. Leveraging over three million frames of first-person video collected in home environments from 31 children aged 5 to 36 months, the authors combine object detection, self-supervised vision models, and multimodal high-dimensional embedding techniques to reveal, for the first time at scale in naturalistic settings, that the distribution of object categories observed by children is highly skewed and frequently appears in atypical viewpoints and under occlusion. Despite the sparsity and variability of specific categories, superordinate-level categories exhibit a stronger clustering structure than those found in standard image datasets. These findings provide critical empirical grounding for developing efficient and robust models of visual learning.
π Abstract
Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children's visual experience at home from the BabyView dataset ($N$ = 31 participants, 868 hours, ages 5--36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children's object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children's visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.