AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual robotic manipulation (VRM) suffers from scarce robot interaction data and high costs of multimodal annotation. Existing vision-language pretraining approaches either rely on non-task-specific web data or employ implicit modeling (e.g., frame prediction), resulting in poor generalization under few-shot settings. To address this, we propose an analogy-based cross-modal action transfer framework. Our method explicitly extracts action knowledge from human hand keypoints—establishing a structured analogical mapping between human motion and robot actuator dynamics for the first time. It integrates keypoint-driven vision-language pretraining, human action video retrieval, historical observation alignment, and an analogy reasoning network. Evaluated on the CALVIN benchmark and real-robot experiments, our approach significantly outperforms state-of-the-art methods in few-shot scenarios, demonstrating that human motion priors effectively enhance robotic generalization across tasks and environments.

Technology Category

Application Category

📝 Abstract
Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they either utilize web data that differs from robotic tasks, or train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action keypoints instead of irrelevant visual cues, our method achieves leading performance on the CALVIN benchmark {and real-world experiments}. In few-shot scenarios, our AR-VRM outperforms previous methods by large margins , underscoring the effectiveness of explicitly imitating human actions under data scarcity.
Problem

Research questions and friction points this paper is trying to address.

Addressing data scarcity in Visual Robot Manipulation (VRM) tasks
Improving generalization by imitating human actions explicitly
Enhancing few-shot performance via analogical reasoning from human videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Keypoint VLM pretraining for human action knowledge
Analogical Reasoning map between human and robot actions
Explicit human action imitation for robot manipulation
🔎 Similar Papers
No similar papers found.
Dejie Yang
Dejie Yang
Peking University
VLMRobot
Zijing Zhao
Zijing Zhao
Lenovo Research
Y
Yang Liu
Wangxuan Institute of Computer Technology, Peking University; State Key Laboratory of General Artificial Intelligence, Peking University