🤖 AI Summary
This work addresses the key challenge in embodied intelligence of effectively extracting transferable manipulation intent priors from large-scale human demonstration videos. The authors propose the MoT-HRA framework, which constructs the HA-2.2M dataset through hand-centric filtering, 3D spatial reconstruction, temporal segmentation, and language alignment. They design a three-tier disentangled expert architecture: a vision-language expert predicts 3D trajectories, an intent expert models MANO-based hand motion priors, and a refinement expert maps these representations to robot action primitives. Leveraging a shared attention backbone and a read-only key-value propagation mechanism, the method enables efficient transfer while preserving upstream representations. Experiments demonstrate significant improvements in hand motion generation, simulation, and real-world robotic tasks, particularly in action plausibility and out-of-distribution robustness.
📝 Abstract
Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.