Robot Learning from Human Videos: A Survey

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the challenge of data scarcity in robot learning by providing a systematic survey of recent advances in transferring manipulation skills from human videos. It introduces the first hierarchical taxonomy tailored to robotic skill acquisition, integrating human-to-robot transfer pathways, data configurations, and learning paradigms across three levels: task, observation, and action. The study further presents a large-scale statistical analysis of existing video datasets, characterizing their scale, structure, and usage trends. By synthesizing developments in policy learning, computer vision, generative modeling, and cross-paradigm coupling methods, this survey comprehensively maps the current landscape, identifies key challenges and limitations, and outlines promising future directions. To foster community progress, the authors also release an open-source collection of relevant papers.
📝 Abstract
A critical bottleneck hindering further advancement in embodied AI and robotics is the challenge of scaling robot data. To address this, the field of learning robot manipulation skills from human video data has attracted rapidly growing attention in recent years, driven by the abundance of human activity videos and advances in computer vision. This line of research promises to enable robots to acquire skills passively from the vast and readily available resource of human demonstrations, substantially favoring scalable learning for generalist robotic systems. Therefore, we present this survey to provide a comprehensive and up-to-date review of human-video-based learning techniques in robotics, focusing on both human-robot skill transfer and data foundations. We first review the policy learning foundations in robotics, and then describe the fundamental interfaces to incorporate human videos. Subsequently, we introduce a hierarchical taxonomy of transferring human videos to robot skills, covering task-, observation-, and action-oriented pathways, along with a cross-family analysis of their couplings with different data configurations and learning paradigms. In addition, we investigate the data foundations including widely-used human video datasets and video generation schemes, and provide large-scale statistical trends in dataset development and utilization. Ultimately, we emphasize the challenges and limitations intrinsic to this field, and delineate potential avenues for future research. The paper list of our survey is available at https://github.com/IRMVLab/awesome-robot-learning-from-human-videos.
Problem

Research questions and friction points this paper is trying to address.

robot learning
human videos
skill transfer
data scaling
embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

human-video-based learning
skill transfer
hierarchical taxonomy
data foundations
embodied AI
Junyi Ma
Junyi Ma
Ph.D Candidate, Shanghai Jiao Tong University (SJTU)
RoboticsIntelligent Vehicles
E
Erhang Zhang
Shanghai Jiao Tong University, China
Haoran Yang
Haoran Yang
Central South University
Graph Neural NetworksData MiningRecommendation Systems
D
Ditao Li
Shanghai Jiao Tong University, China
Chenyang Xu
Chenyang Xu
East China Normal University
Theoretical Computer ScienceOperations Research
Guangming Wang
Guangming Wang
University of Cambridge, ETH Zurich, and Shanghai Jiao Tong University
Robot VisionRobot ManipulationRoboticsComputer VisionAutonomous Driving
H
Hesheng Wang
Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China and the Key Laboratory of System Control and Information Processing, Ministry of Education of China