V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the fundamental challenge of enabling AI agents to understand and plan within the physical world through observational learning. We propose a self-supervised modeling framework that jointly leverages internet-scale video data and sparse robot interaction data. Our key contribution is V-JEPA 2, the first latent-space action-conditioned world model deployable zero-shot using only <62 hours of unlabeled robot videos—requiring neither environment interaction nor task-specific reward signals. Built upon the Joint Embedding Predictive Architecture (JEPA), it integrates video-image multimodal pretraining, LLM alignment, and image-goal-driven planning inference. V-JEPA 2 achieves state-of-the-art performance on Something-Something v2 (77.3% top-1 accuracy), Epic-Kitchens-100 action anticipation (Recall@5 = 39.7%), and PerceptionTest video question answering (84.0%). Critically, it enables zero-shot cross-laboratory deployment, successfully driving a Franka Emika robot to execute grasp-and-place tasks without task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

Problem

Research questions and friction points this paper is trying to address.

Learn world understanding and action from observation

Combine video data with robot trajectories for prediction

Apply self-supervised learning to robotic planning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning with internet-scale video data

Joint-embedding-predictive architecture V-JEPA 2

Action-conditioned world model V-JEPA 2-AC

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding