π€ AI Summary
This work addresses the limitations of traditional generative modeling, which often focuses on pixel-level reconstruction and struggles to capture high-level semantics. To overcome this, the authors propose an Energy-Based Joint Embedding Predictive Architecture (EB-JEPA) that performs self-supervised prediction in representation space rather than pixel space, effectively enabling the construction of world models for images, videos, and action-conditioned environments. The study introduces the first open-source, lightweight, and modular EB-JEPA library, systematically demonstrating the critical role of regularization in preventing representational collapse. The framework supports multi-step temporal prediction and action-conditioned modeling, achieving strong empirical results: 91% probe accuracy on CIFAR-10, high-quality multi-step video prediction on Moving MNIST, and a 97% planning success rate on the Two Rooms navigation taskβall trained within hours on a single GPU.
π Abstract
We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.