In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods leverage egocentric video data only for simple pretraining, failing to fully exploit its potential for learning actionable robotic policies. Method: We propose Human0—the first large-scale, language-conditioned manipulation policy model trained exclusively on human behavioral videos (without any robot interaction data). Our approach introduces a novel “in-the-wild vs. on-task” data categorization framework, integrating flow-matching-based policy modeling with domain adaptation techniques, and is trained on the PHSD dataset (1,000+ hours of in-the-wild video plus 20 hours of task-aligned video). Contribution/Results: Human0 achieves, for the first time, language-guided policy execution, few-shot skill adaptation, and cross-domain transfer—entirely from human demonstration videos. Experiments demonstrate strong generalization on real robots: it enables natural-language control, rapid zero- or few-shot adaptation, and significantly improves operational robustness.

Technology Category

Application Category

📝 Abstract
Egocentric videos are a valuable and scalable data source to learn manipulation policies. However, due to significant data heterogeneity, most existing approaches utilize human data for simple pre-training, which does not unlock its full potential. This paper first provides a scalable recipe for collecting and using egocentric data by categorizing human data into two categories: in-the-wild and on-task alongside with systematic analysis on how to use the data. We first curate a dataset, PHSD, which contains over 1,000 hours of diverse in-the-wild egocentric data and over 20 hours of on-task data directly aligned to the target manipulation tasks. This enables learning a large egocentric language-conditioned flow matching policy, Human0. With domain adaptation techniques, Human0 minimizes the gap between humans and humanoids. Empirically, we show Human0 achieves several novel properties from scaling human data, including language following of instructions from only human data, few-shot learning, and improved robustness using on-task data. Project website: https://xiongyicai.github.io/In-N-On/
Problem

Research questions and friction points this paper is trying to address.

Learning manipulation policies from heterogeneous egocentric video data
Bridging the domain gap between human demonstrations and robot execution
Scaling policy learning with in-the-wild and on-task human data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Categorizing human data into in-the-wild and on-task
Learning egocentric language-conditioned flow matching policy
Using domain adaptation to bridge human-humanoid gap
🔎 Similar Papers
No similar papers found.
X
Xiongyi Cai
UC San Diego
Ri-Zhao Qiu
Ri-Zhao Qiu
University of California San Diego
RoboticsComputer Vision
G
Geng Chen
UC San Diego
L
Lai Wei
UC San Diego
Isabella Liu
Isabella Liu
University of California, San Diego
Computer VisionComputer Graphics
T
Tianshu Huang
UC San Diego
Xuxin Cheng
Xuxin Cheng
University of California, San Diego
X
Xiaolong Wang
UC San Diego