🤖 AI Summary
This work addresses the instability of existing Joint Embedding Predictive Architecture (JEPA) approaches in end-to-end pixel-level world model training, which often suffer from representation collapse and rely on complex loss functions, pretrained encoders, or auxiliary supervision. To overcome these limitations, we propose LeWorldModel (LeWM), which achieves stable end-to-end JEPA training without pretraining or additional supervision by employing only two loss terms: next-embedding prediction and Gaussian latent regularization. Our method reduces tunable hyperparameters from six to one and utilizes a lightweight network with approximately 15 million parameters, enabling efficient single-GPU training. LeWM matches or exceeds baseline performance across diverse 2D and 3D control tasks, achieves planning speeds 48× faster than foundation models, and learns a latent space that encodes physical structure and reliably detects anomalous events.
📝 Abstract
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48x faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.