🤖 AI Summary
This work investigates whether first-person video from a single individual can support general-purpose visual representation learning. Method: We propose “lifelong learning from a single person,” a novel paradigm that trains a visual encoder via self-supervised contrastive learning on approximately 30 hours of continuous egocentric video captured from one subject over one week. To assess functional consistency across individuals, we introduce a cross-attention-based representation alignment metric—first applied to evaluate geometric understanding (e.g., depth estimation). Results: Models trained solely on this single-person lifelong data achieve performance on downstream tasks—including depth estimation—that matches or surpasses baselines trained on large-scale, heterogeneous web data. Our findings reveal that continuous, individual perceptual experience encodes strong, structured priors, offering a promising pathway for visual representation learning in low-resource and privacy-sensitive settings.
📝 Abstract
We introduce the"single-life"learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.