FMimic: Foundation Models are Fine-grained Action Learners from Human Videos

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-based imitation learning methods rely on predefined motion primitives for physical interaction, limiting generalization and precision. This paper introduces FMimic: the first end-to-end framework that directly leverages foundation models—specifically vision-language models (VLMs)—for fine-grained action skill learning, requiring only a small number of human demonstration videos to achieve high-fidelity imitation while eliminating dependence on handcrafted action primitives. Its core innovation lies in transferring VLMs’ cross-modal semantic understanding to low-level action modeling, enabling long-horizon planning and sub-centimeter manipulation. On the RLBench multi-task benchmark, FMimic achieves a 39% absolute improvement in success rate; on real-robot experiments, it improves success by 29%. Moreover, it surpasses state-of-the-art baselines by 34% on high-precision subtasks and by 47% on long-horizon subtasks.

Technology Category

Application Category

📝 Abstract
Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in foundation models, particularly Vision Language Models (VLMs), have demonstrated remarkable capabilities in visual and linguistic reasoning for VIL tasks. Despite this progress, existing approaches primarily utilize these models for learning high-level plans from human demonstrations, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck for robotic systems. In this work, we present FMimic, a novel paradigm that harnesses foundation models to directly learn generalizable skills at even fine-grained action levels, using only a limited number of human videos. Extensive experiments demonstrate that our FMimic delivers strong performance with a single human video, and significantly outperforms all other methods with five videos. Furthermore, our method exhibits significant improvements of over 39% and 29% in RLBench multi-task experiments and real-world manipulation tasks, respectively, and exceeds baselines by more than 34% in high-precision tasks and 47% in long-horizon tasks.
Problem

Research questions and friction points this paper is trying to address.

Enables robots to learn fine-grained actions from human videos
Overcomes reliance on pre-defined motion primitives in robotics
Improves performance in multi-task and real-world manipulation scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses foundation models for fine-grained action learning
Learns generalizable skills from few human videos
Improves performance in precision and long-horizon tasks
🔎 Similar Papers
No similar papers found.
Guangyan Chen
Guangyan Chen
Beijing Institute of Technology
M
Meiling Wang
Beijing Institute of Technology, Beijing, P. R. China
Te Cui
Te Cui
Beijing Institute of Technology
Embodied AI
Y
Yao Mu
The University of Hong Kong, Hong Kong, P. R. China
Haoyang Lu
Haoyang Lu
Max Planck UCL Centre for Computational Psychiatry and Ageing Research
Computational psychiatryDecision-makingReinforcement learningDepressionAutism
Z
Zicai Peng
Beijing Institute of Technology, Beijing, P. R. China
M
Mengxiao Hu
Beijing Institute of Technology, Beijing, P. R. China
T
Tianxing Zhou
Beijing Institute of Technology, Beijing, P. R. China
M
Mengyin Fu
Beijing Institute of Technology, Beijing, P. R. China
Y
Yi Yang
Beijing Institute of Technology, Beijing, P. R. China
Y
Yufeng Yue
Beijing Institute of Technology, Beijing, P. R. China