MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Robots struggle to achieve functional equivalence across morphologically distinct tools from a single human demonstration, primarily due to significant geometric variations among functionally similar tools—termed *intra-functional variation*—which hinder functional-level alignment. To address this, we propose the *Function Frame*, a keypoint-based local abstract representation that explicitly encodes tool functionality and motion semantics. This enables one-shot generalization from monocular RGB-D demonstration videos to novel tools. Our method jointly learns visuomotor policies and functional-equivalence mappings, synthesizing simulation trajectories for policy training without requiring human teleoperation data. Experiments demonstrate successful cross-tool skill transfer across multiple functional equivalence tasks, substantially improving the functional robustness and generalization capability of imitation learning.

Technology Category

Application Category

📝 Abstract

Imitating tool manipulation from human videos offers an intuitive approach to teaching robots, while also providing a promising and scalable alternative to labor-intensive teleoperation data collection for visuomotor policy learning. While humans can mimic tool manipulation behavior by observing others perform a task just once and effortlessly transfer the skill to diverse tools for functionally equivalent tasks, current robots struggle to achieve this level of generalization. A key challenge lies in establishing function-level correspondences, considering the significant geometric variations among functionally similar tools, referred to as intra-function variations. To address this challenge, we propose MimicFunc, a framework that establishes functional correspondences with function frame, a function-centric local coordinate frame constructed with keypoint-based abstraction, for imitating tool manipulation skills. Experiments demonstrate that MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools for functionally equivalent tasks. Furthermore, leveraging MimicFunc's one-shot generalization capability, the generated rollouts can be used to train visuomotor policies without requiring labor-intensive teleoperation data collection for novel objects. Our code and video are available at https://sites.google.com/view/mimicfunc.

Problem

Research questions and friction points this paper is trying to address.

Imitating tool manipulation from single human videos

Establishing function-level correspondences across diverse tools

Enabling robots to generalize skills to novel tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

Function frame for tool manipulation imitation

Keypoint-based abstraction for functional correspondence

One-shot generalization from human video

🔎 Similar Papers

Vision-based Manipulation from Single Human Video with Open-World Object Graphs