Unique Lives, Shared World: Learning from Single-Life Videos

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This work investigates whether first-person video from a single individual can support general-purpose visual representation learning. Method: We propose “lifelong learning from a single person,” a novel paradigm that trains a visual encoder via self-supervised contrastive learning on approximately 30 hours of continuous egocentric video captured from one subject over one week. To assess functional consistency across individuals, we introduce a cross-attention-based representation alignment metric—first applied to evaluate geometric understanding (e.g., depth estimation). Results: Models trained solely on this single-person lifelong data achieve performance on downstream tasks—including depth estimation—that matches or surpasses baselines trained on large-scale, heterogeneous web data. Our findings reveal that continuous, individual perceptual experience encodes strong, structured priors, offering a promising pathway for visual representation learning in low-resource and privacy-sensitive settings.

Technology Category

Application Category

📝 Abstract

We introduce the"single-life"learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

Problem

Research questions and friction points this paper is trying to address.

Training distinct vision models on individual egocentric video data

Learning self-supervised visual encoders from multiple viewpoints in single lives

Evaluating geometric alignment and transferability of single-life representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning from single-life egocentric videos

Cross-attention metric to quantify model representation alignment

Single-life training matches diverse web data performance

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)