Temporal Slowness in Central Vision Drives Semantic Object Learning

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This study investigates the unsupervised learning of semantic object representations from human-centric visual streams. Leveraging the Ego4D dataset, the authors employ a state-of-the-art gaze prediction model to generate fixation points, apply foveated cropping to the video streams, and introduce temporal contrastive self-supervised learning to model temporal slowness. This work is the first to integrate foveated modeling with temporal slowness learning, revealing how these two mechanisms synergistically facilitate the formation of multidimensional semantic representations in natural visual experience. Experimental results demonstrate that the proposed approach significantly enhances the extraction of foreground object features and the encoding of semantic information during fixations.

Technology Category

Application Category

📝 Abstract

Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.

Problem

Research questions and friction points this paper is trying to address.

semantic object learning

central vision

temporal slowness

egocentric vision

visual representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

central vision

temporal slowness

self-supervised learning