🤖 AI Summary
Existing robotic trajectory datasets struggle to support broad physical commonsense learning, and there is a lack of effective methods for extracting structured physical knowledge from everyday human interactions. This work presents the first systematic approach to converting large-scale first-person human videos into supervisory signals for physical commonsense question answering. By parsing scene elements and spatial dynamics, modeling deep relational structures, and training vision-language models, the method establishes a language-aware policy transfer mechanism that enables language-sensitive adaptation while preserving original capabilities. The approach achieves state-of-the-art performance across multiple benchmarks—including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa—and demonstrates exceptional out-of-distribution generalization, particularly on SimplerEnv.
📝 Abstract
Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.