AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Robot learning faces bottlenecks of scarce action-labeled data and poor generalization. To address this, we propose the first action-agnostic motion prior framework, which leverages large-scale unlabeled video corpora. By modeling keypoint trajectories and discretizing motion into tokens, our approach decouples visual motion prediction from action inference and modularly constructs forward and inverse dynamics models. The framework enables zero-shot action-data transfer across tasks, achieving full-task generalization on the LIBERO benchmark for the first time. It reduces motion prediction MSE by 73% and improves pixel-level accuracy by 150%. Using only human demonstration videos without action annotations, it boosts average policy performance by 1.4×; under low-action-data regimes, performance gains range from 1.2× to 2.2×. This work establishes the first transferable, video-driven world model.

Technology Category

Application Category

📝 Abstract
Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it. We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples, allowing for independent scaling. Extensive evaluations demonstrate that the learned dynamics are both accurate, achieving up to 3.7x better MSE and over 2.5x better pixel prediction accuracy compared to prior approaches, and broadly useful. In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data. Beyond robotic control, we find the dynamics learned by AMPLIFY to be a versatile latent world model, enhancing video prediction quality. Our results present a novel paradigm leveraging heterogeneous data sources to build efficient, generalizable world models. More information can be found at https://amplify-robotics.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Leveraging action-free videos to overcome scarce labeled robotics data
Decoupling motion prediction from action inference for modular learning
Enhancing policy learning with accurate dynamics from diverse video sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Encodes visual dynamics into discrete motion tokens
Separates motion prediction from action inference
Combines action-free and action-labeled data training
J
Jeremy A. Collins
Georgia Tech
L
Lor'and Cheng
Georgia Tech
K
Kunal Aneja
Georgia Tech
A
Albert Wilcox
Georgia Tech
B
Benjamin Joffe
Georgia Tech Research Institute
Animesh Garg
Animesh Garg
Georgia Institute of Technology, University of Toronto
Robotic ManipulationRobot LearningReinforcement LearningMachine LearningComputer Vision