🤖 AI Summary
In single-video imitation learning for robotic motor skill acquisition, inefficient frame sampling and suboptimal reward design lead to training redundancy and high computational overhead. To address this, we propose a motion-aware frame selection mechanism and a three-stage hybrid training framework. Our approach eliminates handcrafted reward functions by jointly leveraging vision-language models (VLMs) and motion saliency modeling to enable adaptive keyframe identification. It further integrates phased reinforcement learning with online policy fine-tuning to enhance both training efficiency and policy generalizability. Experiments in simulation and on real robotic platforms demonstrate that our method faithfully reproduces complex locomotion skills—such as gaits—with significantly reduced computational resources: training speed improves by up to 2.3× compared to baseline methods. This work establishes a new paradigm for data-efficient, low-overhead embodied skill learning.
📝 Abstract
Vision-language models (VLMs) have demonstrated excellent high-level planning capabilities, enabling locomotion skill learning from video demonstrations without the need for meticulous human-level reward design. However, the improper frame sampling method and low training efficiency of current methods remain a critical bottleneck, resulting in substantial computational overhead and time costs. To address this limitation, we propose Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos (MA-ROESL). MA-ROESL integrates a motion-aware frame selection method to implicitly enhance the quality of VLM-generated reward functions. It further employs a hybrid three-phase training pipeline that improves training efficiency via rapid reward optimization and derives the final policy through online fine-tuning. Experimental results demonstrate that MA-ROESL significantly enhances training efficiency while faithfully reproducing locomotion skills in both simulated and real-world settings, thereby underscoring its potential as a robust and scalable framework for efficient robot locomotion skill learning from video demonstrations.