Human Gaze Boosts Object-Centered Representation Learning

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of improving object-centricity and recognition performance in self-supervised visual representation learning by emulating human gaze mechanisms. Methodologically, it leverages the large-scale Egocentric 4D (Ego4D) video dataset and employs a gaze prediction model to localize fixation centers; high-resolution central regions are then dynamically cropped around these points, and temporal contrastive learning is applied to capture the spatiotemporal dynamics of gaze movement. Crucially, this is the first approach to incorporate biologically inspired foveal magnification—selective high-acuity amplification of attended regions—into an egocentric self-supervised learning framework. Experimental results demonstrate substantial improvements in object-centered representation quality, leading to measurable gains in downstream tasks such as image classification and significantly narrowing the performance gap between learned models and human visual recognition. The findings validate gaze guidance as a critical inductive bias for effective visual representation learning.

Technology Category

Application Category

📝 Abstract
Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans' gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze location. Finally, we train a time-based SSL model on these modified inputs. Our experiments demonstrate that focusing on central vision leads to better object-centered representations. Our analysis shows that the SSL model leverages the temporal dynamics of the gaze movements to build stronger visual representations. Overall, our work marks a significant step toward bio-inspired learning of visual representations.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised Learning
Visual Feature Recognition
Human-level Performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human Visual Attention
Self-supervised Learning
Biologically-inspired Learning
🔎 Similar Papers
No similar papers found.