Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the key challenge in embodied intelligence of effectively extracting transferable manipulation intent priors from large-scale human demonstration videos. The authors propose the MoT-HRA framework, which constructs the HA-2.2M dataset through hand-centric filtering, 3D spatial reconstruction, temporal segmentation, and language alignment. They design a three-tier disentangled expert architecture: a vision-language expert predicts 3D trajectories, an intent expert models MANO-based hand motion priors, and a refinement expert maps these representations to robot action primitives. Leveraging a shared attention backbone and a read-only key-value propagation mechanism, the method enables efficient transfer while preserving upstream representations. Experiments demonstrate significant improvements in hand motion generation, simulation, and real-world robotic tasks, particularly in action plausibility and out-of-distribution robustness.

Technology Category

Application Category

📝 Abstract

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

Problem

Research questions and friction points this paper is trying to address.

human-intention priors

robotic manipulation

human demonstrations

vision-language-action

embodiment

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-intention priors

hierarchical vision-language-action framework

embodiment-agnostic trajectory