Do Blind Spots Matter for Word-Referent Mapping? A Computational Study with Infant Egocentric Video

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Infants aged 6–9 months face the “symbol grounding” challenge in word-referent mapping: lacking prior knowledge, they struggle to associate novel auditory words with corresponding visual objects. Method: Leveraging first-person infant video data, we propose a biologically inspired visual masking strategy—modeling localized visual occlusion based on the human physiological blind spot, replacing conventional random masking to enhance neurobiological plausibility. We integrate this masking into a masked autoencoder-based visual backbone and embed it within a contrastive learning framework for self-supervised, cross-modal (video–text) representation learning. Results: Our approach achieves comparable performance to random masking on cross-contextual and long-temporal word-referent learning tasks, while better aligning with early visual cognitive mechanisms. It offers an interpretable, neuroscientifically grounded paradigm for modeling infant language acquisition.

Technology Category

Application Category

📝 Abstract

Typically, children start to learn their first words between 6 and 9 months, linking spoken utterances to their visual referents. Without prior knowledge, a word encountered for the first time can be interpreted in countless ways; it might refer to any of the objects in the environment, their components, or attributes. Using longitudinal, egocentric, and ecologically valid data from the experience of one child, in this work, we propose a self-supervised and biologically plausible strategy to learn strong visual representations. Our masked autoencoder-based visual backbone incorporates knowledge about the blind spot in human eyes to define a novel masking strategy. This mask and reconstruct approach attempts to mimic the way the human brain fills the gaps in the eyes' field of view. This represents a significant shift from standard random masking strategies, which are difficult to justify from a biological perspective. The pretrained encoder is utilized in a contrastive learning-based video-text model capable of acquiring word-referent mappings. Extensive evaluation suggests that the proposed biologically plausible masking strategy is at least as effective as random masking for learning word-referent mappings from cross-situational and temporally extended episodes.

Problem

Research questions and friction points this paper is trying to address.

Explores how blind spots affect word-referent mapping in infant learning

Develops biologically plausible masking strategy mimicking human visual processing

Evaluates self-supervised approach for learning word meanings from egocentric video

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked autoencoder with blind spot masking strategy

Self-supervised contrastive learning for word-referent mapping

Biologically plausible visual representation from egocentric video

🔎 Similar Papers

Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker