PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) rely predominantly on third-person data and struggle to generalize to egocentric, first-person perspectives essential for humanoid robotics. Method: We propose the Egocentric2Embodiment (E2E) translation pipeline, which automatically constructs E2E-3M—a large-scale, temporally consistent, evidence-anchored, multi-level visual question answering (VQA) dataset—from massive human egocentric video corpora. Building upon E2E-3M, we introduce PhysBrain: the first egocentric-aware embodied brain architecture for physical intelligence, integrating egocentric video parsing, schema-driven VQA generation, evidence-grounded constraints, and temporal modeling—supporting both embodied VLM pretraining and vision-language-action (VLA) fine-tuning. Results: PhysBrain achieves significant gains in reasoning comprehension on the EgoThink planning benchmark; improves sample efficiency during VLA fine-tuning; and attains a 53.9% success rate on SimplerEnv tasks—establishing a scalable, first-person supervision paradigm for embodied intelligence.

Technology Category

Application Category

📝 Abstract
Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9%), demonstrating effective transfer from human egocentric supervision to downstream robot control.
Problem

Research questions and friction points this paper is trying to address.

Bridge viewpoint mismatch between VLMs and humanoid robots
Convert human egocentric videos into structured training supervision
Enable sample-efficient robot control transfer from human data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging human egocentric videos for scalable robot training data
Converting first-person videos into structured VQA supervision with grounding
Training an egocentric-aware model for improved planning and control transfer
🔎 Similar Papers
No similar papers found.
X
Xiaopeng Lin
The Hong Kong University of Science and Technology (Guangzhou)
S
Shijie Lian
Huazhong University of Science and Technology
B
Bin Yu
Harbin Institute of Technology
R
Ruoqi Yang
Zhongguancun Institute of Artificial Intelligence
C
Changti Wu
Zhongguancun Academy
Y
Yuzhuo Miao
Harbin Institute of Technology
Y
Yurun Jin
Zhongguancun Institute of Artificial Intelligence
Y
Yukun Shi
Zhongguancun Institute of Artificial Intelligence
Cong Huang
Cong Huang
University of Science and Technology of China
Image/Video processing
B
Bojun Cheng
The Hong Kong University of Science and Technology (Guangzhou)
K
Kai Chen
Zhongguancun Academy, Zhongguancun Institute of Artificial Intelligence, DeepCybo