🤖 AI Summary
Existing self-supervised learning models neglect the biological eccentricity of human foveated vision—characterized by high-resolution processing at the retinal center and progressively lower resolution toward the periphery—thus failing to emulate the development of semantically grounded object representations driven by infant visual experience. This work pioneers the integration of cortical magnification modeling and gaze simulation into self-supervised learning, establishing a biologically inspired first-person video preprocessing pipeline coupled with a temporal contrastive learning framework. Methodologically, it explicitly models the retinal resolution gradient via dynamic central cropping and multi-scale peripheral downsampling, thereby encouraging joint representation learning of central objects and peripheral context. Experiments on real egocentric video data demonstrate that the approach significantly enhances the discriminability and generalizability of object representations, yielding an average +3.2% mAP improvement on downstream detection and segmentation tasks. Crucially, it achieves more balanced encoding of foveal and peripheral information, establishing a novel paradigm for embodied visual representation learning.
📝 Abstract
Recent self-supervised learning models simulate the development of semantic object representations by training on visual experience similar to that of toddlers. However, these models ignore the foveated nature of human vision with high/low resolution in the center/periphery of the visual field. Here, we investigate the role of this varying resolution in the development of object representations. We leverage two datasets of egocentric videos that capture the visual experience of humans during interactions with objects. We apply models of human foveation and cortical magnification to modify these inputs, such that the visual content becomes less distinct towards the periphery. The resulting sequences are used to train two bio-inspired self-supervised learning models that implement a time-based learning objective. Our results show that modeling aspects of foveated vision improves the quality of the learned object representations in this setting. Our analysis suggests that this improvement comes from making objects appear bigger and inducing a better trade-off between central and peripheral visual information. Overall, this work takes a step towards making models of humans' learning of visual representations more realistic and performant.