MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing egocentric datasets are limited in duration, making it difficult to model the long-horizon temporal dependencies required for complex robotic tasks. This work proposes the first large-scale, long-duration egocentric data collection framework based on smartphones, leveraging multimodal sensors and high-precision camera pose tracking from consumer-grade devices to enable continuous trajectory capture over hour-long sessions. The project releases an open-source mobile application and an end-to-end standardized processing pipeline, along with a new dataset comprising 200 hours of diverse real-world scenarios. By significantly lowering the barrier to data acquisition, this effort advances research in vision-language-action models and embodied foundation models while promoting data democratization in the field.

📝 Abstract

The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.

Problem

Research questions and friction points this paper is trying to address.

egocentric data

long horizon

Vision Language Action models

temporal dependencies

robotic task execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric data

long-horizon learning

commodity hardware