🤖 AI Summary
This work addresses the challenge of learning object representations from unlabeled, continuous first-person videos, where cluttered backgrounds, occlusions, and ego-motion severely hinder performance. To this end, the authors propose EgoViT, a novel framework that jointly models prototype-based object discovery and temporal stability for the first time. Built upon the Vision Transformer architecture, EgoViT establishes an end-to-end self-supervised learning loop by integrating intra-frame distillation, depth regularization, and teacher–student temporal consistency constraints, enabling continual refinement of object representations without any manual annotations. Evaluated on standard benchmarks, EgoViT achieves state-of-the-art results in unsupervised object discovery, improving CorLoc by 8.0% and semantic segmentation mIoU by 4.8% over existing methods.
📝 Abstract
Humans develop visual intelligence through perceiving and interacting with their environment - a self-supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first-person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego-motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto-objects" through three synergistic mechanisms: (1) Proto-object Learning, which uses intra-frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher-Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end-to-end on unlabeled first-person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.