🤖 AI Summary
This study investigates whether infant-inspired audiovisual learning paradigms can support zero-shot visual concept generalization—i.e., recognizing unseen object categories without explicit training.
Method: Leveraging longitudinal infant eye-tracking images and parent–child speech data, we construct a training-free framework for discovering visual-concept-selective neurons, grounded in developmental principles. Using neuron activation pattern analysis, concept attribution, and cross-model representation comparison (with CLIP and ImageNet models), we identify latent visual-concept neurons and perform zero-shot object classification.
Contribution/Results: We provide the first empirical validation that an infant-inspired model can recognize novel object categories it has never heard named. The learned representations exhibit both category specificity and cross-categorical generalization—distinct from supervised pre-trained vision models. These findings offer novel computational evidence for embodied language acquisition and the emergence of generic visual concepts.
📝 Abstract
Infants develop complex visual understanding rapidly, even preceding of the acquisition of linguistic inputs. As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? To investigate this, we analyze a recently published model in Science by Vong et al.,which is trained on longitudinal, egocentric images of a single child paired with transcribed parental speech. We introduce a training-free framework that can discover visual concept neurons hidden in the model's internal representations. Our findings show that these neurons can classify objects outside its original vocabulary. Furthermore, we compare the visual representations in infant-like models with those in moder computer vision models, such as CLIP or ImageNet pre-trained model, highlighting key similarities and differences. Ultimately, our work bridges cognitive science and computer vision by analyzing the internal representations of a computational model trained on an infant's visual and linguistic inputs.