🤖 AI Summary
This work addresses the scarcity of high-quality, large-scale real-world interaction data for embodied AI training, a challenge exacerbated by the high cost, strong hardware dependencies, and limited scalability of existing approaches. To overcome these limitations, we propose a lightweight, decentralized first-person data collection paradigm in which users wear an ergonomic smartphone mount and leverage a cross-platform mobile application to capture video anytime and anywhere. Our system integrates on-device real-time computation with a cloud-edge collaborative architecture that enables automatic annotation and filtering in the cloud, facilitating low-cost, scene-agnostic, and continuous data acquisition. The resulting large-scale real-world dataset substantially enhances model generalization on downstream tasks, demonstrating the efficiency, scalability, and practicality of our approach.
📝 Abstract
Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed"human agents"offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.