V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental challenge of enabling AI agents to understand and plan within the physical world through observational learning. We propose a self-supervised modeling framework that jointly leverages internet-scale video data and sparse robot interaction data. Our key contribution is V-JEPA 2, the first latent-space action-conditioned world model deployable zero-shot using only <62 hours of unlabeled robot videos—requiring neither environment interaction nor task-specific reward signals. Built upon the Joint Embedding Predictive Architecture (JEPA), it integrates video-image multimodal pretraining, LLM alignment, and image-goal-driven planning inference. V-JEPA 2 achieves state-of-the-art performance on Something-Something v2 (77.3% top-1 accuracy), Epic-Kitchens-100 action anticipation (Recall@5 = 39.7%), and PerceptionTest video question answering (84.0%). Critically, it enables zero-shot cross-laboratory deployment, successfully driving a Franka Emika robot to execute grasp-and-place tasks without task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
Problem

Research questions and friction points this paper is trying to address.

Learn world understanding and action from observation
Combine video data with robot trajectories for prediction
Apply self-supervised learning to robotic planning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning with internet-scale video data
Joint-embedding-predictive architecture V-JEPA 2
Action-conditioned world model V-JEPA 2-AC
🔎 Similar Papers
No similar papers found.
M
Mahmoud Assran
FAIR at Meta
Adrien Bardes
Adrien Bardes
Research Scientist at Meta, FAIR
Computer VisionMachine LearningDeep Learning
David Fan
David Fan
Meta FAIR Labs
AIComputer VisionDeep LearningRepresentation Learning
Quentin Garrido
Quentin Garrido
Research scientist, FAIR at Meta
Self-supervised learning
Russell Howes
Russell Howes
Facebook
M
Mojtaba Komeili
FAIR at Meta
Matthew Muckley
Matthew Muckley
Meta Fundamental AI Research
Video UnderstandingMachine LearningCompressionImage ReconstructionMRI
A
Ammar Rizvi
FAIR at Meta
C
Claire Roberts
FAIR at Meta
Koustuv Sinha
Koustuv Sinha
Research Scientist, Meta AI (Fundamental AI Research), McGill University (MSc, PhD)
language generationlanguage reasoninggraph neural networkssystematic generalization
Artem Zholus
Artem Zholus
Visiting Researcher at Meta; PhD student at MILA
Reinforcement learningNatural Language ProcessingComputer Vision
Sergio Arnaud
Sergio Arnaud
Meta AI
AI
A
Abha Gejji
FAIR at Meta
Ada Martin
Ada Martin
Carnegie Mellon University
Machine Learning
Francois Robert Hogan
Francois Robert Hogan
Research Scientist, Embodied AI, META FAIR
RoboticsReinforcement LearningManipulation
Daniel Dugas
Daniel Dugas
PhD, Autonomous Systems Lab, ETH Zurich
Machine learningRoboticsSLAM
Piotr Bojanowski
Piotr Bojanowski
Meta FAIR
Computer VisionMachine Learning
Vasil Khalidov
Vasil Khalidov
Meta AI
computer visionself-supervised learninggenerative AI
Patrick Labatut
Patrick Labatut
Meta
Computer VisionComputer GraphicsMachine Learning
Francisco Massa
Francisco Massa
Research Engineer at Facebook AI Research
Artificial IntelligenceComputer VisionMachine Learning
Marc Szafraniec
Marc Szafraniec
Research Engineer, Facebook AI Research
Artificial IntelligenceDeep Learning
K
Kapil Krishnakumar
FAIR at Meta
Y
Yong Li
FAIR at Meta
Xiaodong Ma
Xiaodong Ma
Zhejiang normal university
Science of learningAI education
Sarath Chandar
Sarath Chandar
Associate Professor @ Polytechnique Montreal. Mila. Canada CIFAR AI Chair. Canada Research Chair.
Artificial IntelligenceMachine LearningDeep LearningReinforcement LearningNLP
Franziska Meier
Franziska Meier
Research Scientist, Facebook AI Research
Machine LearningRobotics
Yann LeCun
Yann LeCun
Chief AI Scientist at Facebook & JT Schwarz Professor at the Courant Institute, New York University
AImachine learningcomputer visionroboticsimage compression
Michael Rabbat
Michael Rabbat
Research Scientist, FAIR at Meta
Self-Supervised LearningMachine LearningOptimizationSignal ProcessingDistributed Computation
Nicolas Ballas
Nicolas Ballas
Meta AI Research